
RBC
Lead Software Engineer (Contract, Site Reliability Engineering)
July 2018 to March 2021 · Toronto, ON, Canada
The Royal Bank of Canada (RBC) is the largest bank in Canada by asset value, with over 80,000 employees worldwide.
Highlights
- Joined during the formation of the SRE team (six months old at the time) to build visibility and reporting infrastructure.
- Designed and developed Global Dashboard, a bespoke reporting platform providing secure, filtered visibility into operational data across the bank.
- Ensured data isolation to prevent sensitive information from leaking outside the SRE team while enabling safe, organization-wide access to key metrics.
- Reduced incident-response time by streamlining access to critical operational insights and eliminating the need for manual log filtering.
- Flagged and resolved a critical data-management flaw in the “Seven-Year Transaction History” project, preventing multi-month delays and the scrapping of nationwide marketing campaigns — enabling RBC to become the first Canadian bank to offer customers complete seven-year transaction visibility.
- Operated with a high degree of independence and trust, identifying opportunities and building solutions proactively to support SRE and internal stakeholders.
- Collaborated directly with internal customers to prioritize and deliver new dashboard features, effectively acting as technical product owner for the platform.
- Created complementary tools including an anomaly-detection model (scikit-learn) and an internal chatbot that reduced time spent responding to customer inquiries.
Application Development & Custom Tooling
SRE Global Dashboard, 2019–2021
The SRE Global Dashboard is a collection of dashboards that were designed to be accessed by everyone within RBC. Directors, product managers and engineers rely on this data to monitor the availability of more than 600 APIs and applications.
I conceptualized the idea of this project after we (the SRE team) exhausted the capabilities of the Elastic Kibana platform. Kibana is great for rapid visualization of data, but there are significant limitations in the types of reporting it can do. I designed and created a stand-alone application that handles these reports which allowed us to have more control over the way data is presented.
- Interactive series of dashboards for monitoring infrastructure. Know what services are working and how well they’re performing.
- An organized, real-time view of hundreds of servers across multiple data centers. This was an essential tool during a zero-downtime system upgrade, as it allowed the SRE team to be sure that traffic was routed correctly.
- Real-time monitoring of critical infrastructure which the SRE team has an SLA requiring it to provide five-nines availability (99.999%) for the bank.
Seven Years of Transaction Data, 2019–2021
Before this software project began, I fought for certain architectural changes that would ensure Elasticsearch could function at scale. The project was already well underway before I got involved; it was built by a team other than my own and I was relatively new at RBC. So it was an uphill battle right from the start, but I prevailed. The changes I proposed were later implemented and they increased performance and reliability of the service as a result.
- Recommended changes to existing infrastructure that increased reliability of the API
- Built a script that reconciled over 21–billion financial transactions to ensure correctness
- Designed self-healing tools that would automatically respond to known failure modes
- Built tools that automate monthly maintenance activities. Reduced the time that engineers spend each month for maintenance by 12 hrs
- Built a functional prototype that optimizes Elasticsearch queries to prevent over 4,000 failed search queries each month
Chatbot, 2019–2021
The SRE team receives dozens of emails every day, many of which are internal customers asking the same general questions, but are specific to their APIs. These customers expected an immediate answer, but the team was not large enough to provide an immediate response.
RBC’s official Chatbot vendor (the one you'll use on rbc.com), was not capable of providing real-time answers from different sources and could only respond with pre-composed messages.
I built the chatbot from the ground-up to respond to these questions, and answers were sourced from a variety of APIs to provide real-time metrics in its response.
- I built a chatbot from scratch to help the team spend less time answering common questions from internal customers
- Uses TF-IDF for token vectorization and the named–entity resolution (NER) from spaCy
- Made the chatbot answer questions using real–time data in its response
OAuth Client Creator for RBC Capital Markets, 2020–2021
PingFederate’s API system does not support access control layers which are necessary to control who can manage OAuth client IDs. I built a system to provide this level of control over-top of PingFederate’s existing core APIs.
This application allowed RBC to:
- Streamline creation and approval of OAuth clients, minimizing errors
- Added an access control layer to our PingFederate infrastructure
- Improved logging and record-keeping of each change for auditors
Machine Learning / Artificial Intelligence
I used machine learning libraries in Python to automate some of the activities supported by the SRE team. Many of these uses are described above, but some of the key highlights include:
- Developed a chatbot from scratch to answer common questions and provide answers using real-time infrastructure metrics (e.g. “Is service [x] down?” or “Whats the TPS for [thing]”)
- Used NLP via TF-IDF to automatically label reports about previous incidents to help identify areas for improvement
- Developed feature-embedded model that optimizes disk space and makes it possible to store years worth of historical logs while preserving entropy. Terabytes of data were reduced to mere gigabytes; without this embedding, we were forced to purge the data after a short period of time.
Budgeting
Cost Chargeback Model, 2019–2021
In 2020, the SRE team moved to a chargeback costing model. Costs for the services supported by the team were often a mix of direct and shared costs. The pricing model I designed used API request logs to weigh the fixed and variable costs associated each service, then provide a total service costs to each business unit. The results from this model were updated in real-time and displayed in the SRE Global Dashboard.
- Developed a charge back model that weighs department costs to internal customers
- Costing reports were published in the SRE Global Dashboard
- Provided a solution that was easily understood and explainable to stakeholders
Events
I frequently create presentations about the different projects that I worked on for our bi-weekly town hall. In 2019, I also hosted an interactive workshop in partnership with Elastic on machine learning for Catalyst, which is an RBC–specific event that took place during the week of Toronto’s popular technology event, Collision.
10KC
Jul 24, 2020 · Everyday NLP
Catalyst/Collision
Sep 2020 · Rapid Automation by SRE
Sep 25, 2019 · Anomaly Detection with Elasticsearch: Talk + Workshop with step-by-step guide handout and a demo booth featuring our chatbot
Town Hall
May 13, 2020 · Seven–Year Transaction History Automation
Mar 25, 2020 · Introducing the SRE Chargeback Model
Feb 2 & 27, 2020 · Five–Nines Monitoring and Automation
Jan 31, 2020 · Monitoring Five–Nines Availability
Jan 20, 2020 · How the SRE Team Uses Machine Learning To Monitor 1000’s of APIs
Dec 13, 2019 · Monitoring Seven Years of Business & Personal Transaction History
Dec 2, 2019 · SRE Automation with Machine Learning
Oct 10, 2019 · Using Supervised Learning to Detect Anomalies in API Requests
Aug 23, 2019 · Introducing the SRE Chatbot: Why we built it & how it works
Jan 25, 2019 · How to use machine learning in Elasticsearch
Jan 11, 2019 · How to Build Effective ELK Watchers
Dec 7, 2018 · Automating Ping Federate Client Promotions
Nov 9 & 23, 2018 · Validating Seven–Years of Business Transaction History
Publications
Jun 11, 2020. “Using Natural Language Processing to Analyze Enterprise–Wide Incident Reports”
May 1, 2020. “Building Custom Tools to Automate Seven Years of Transaction History”
Mar 15, 2020. “Using Supervised Learning to Detect Anomalies in API Requests”
Active Research
- Optimized feature embedding to reduce terabytes to a few gigs for efficient machine learning
Other Activities
- Coordinating & delegating development/deployment strategies
- Elasticsearch document/query optimization
- Data schema planning
- Built an ELK Heartbeat in Go
- Built a framework for simplifying the design/test/deployment of ELK Watchers
- Wrote Groks (Elastic’s regular expression language) to parse server logs during data ingestion
- Kibana dashboards and reporting
- Code reviews for Java