Sr. Site Reliability Engineer, Software Defined Computing

Toronto, Vancouver (Remote ON or BC only)

Our Client's Networking Team is responsible for the design, implementation, and operation of software-defined networking technology that is at the core of the product. Our SDN technology provides customers Layer 2 through Layer 7 networking features for traditional applications they’ve migrated from on-premises to Cloud. This includes features such as network isolation, MAC and IP address management and translation, policy-based routing, and hybrid-cloud connectivity. Customers can create identical clones of their virtual data centers without any L2 or L3 modifications, and can connect them to one another and to on-prem resources. The Networking team makes this magic possible at scale in data centers across the world.


Your Role

In your role as a Site Reliability Engineer, you’ll use your skills to help instrument our systems so they can be easily built, observed, monitored, tested, and deployed at scale, and ensure services perform well for enterprise customers. Your main focus will be to ensure the platform runs smoothly and effectively and that our processes and technologies scale with the business. You’ll also work closely with your manager to identify and solve important technology and process problems and participate in the development of the team’s technical roadmap.


In order to be effective in this role as a Site Reliability Engineer, you’ll need to have proficiency with general devops and automation. Previous experience with networking technologies is a bonus, but not required and you will have opportunities for exposure and learning. You can expect to spend half of your time writing code: usually provisioning and monitoring automation improvements, bug fixes, and internal technical improvements.


Experience in high-performance networking architecture and operational troubleshooting of network issues is a bonus.

Your Skills

  • Understand that your success is measured by the success of our service’s reliability and performance.

  • Have experience creating and scaling highly available distributed systems.

  • Experience with infrastructure and configuration management tools like Ansible, Puppet, Terraform.

  • Experience with timeseries databases and data visualization tools such as the TICK Stack. (Telegraf, InfluxDB, Chronograf and Kapacitor)

  • Experience with logging, search, and visualization tools such as the Elastic Stack (Elasticsearch, Logstash, Kibana) and Grafana

  • Experience with container orchestration tools such as Docker and Kubernetes

  • Experience with scripting

  • Experience debugging system issues such as memory leaks, using standard system tools such as bash, ps

  • Have intermediate programming experience beyond Bash.

  • Ability to to dig into the details of projects or write scripts to uncover patterns from sources of data.

  • Ability to remain calm and effective in high-stress settings such as interpersonal conflicts, technical debates, and production outages.

  • Detail-oriented reader. You can read a spec and see the big picture as well as missing edge cases.

  • Strong verbal and written communication skills

  • A collaborative attitude and working style.


Your Responsibilities

  • Design and add new monitoring, logging, alerting, and metrics to systems

  • Eventually contribute to the team on-call rotation

  • Plan for and coordinate hardware replacements and repairs

  • Diagnose system issues and assess overall fleet health

  • General system operations work

  • Improve configuration management systems and automation

  • Improve processes and documentation for service administration.

  • Write design documentation for major service improvements.

  • Improve and automate build and testing pipeline.

  • Run performance tests and publish results.

  • Create and improve capacity management modeling and tools

  • Ensure software release process is operating smoothly and effectively

  • Track upstream package updates and determine which updates will need to be pulled in and merged.

  • Incorporate new software into package management repos when needed.

  • Ensure networking systems are secure and compliant with organizational policies and industry standards & practices.

  • As your domain knowledge increase over time, you can take on these additional responsibilities such as assisting in field requests and customer-facing troubleshooting and diagnostics, including working with the Support team.

Other skills as a plus

  • Knowledge of kernel internals and tunables.

  • Core networking domain technology knowledge & experience such as TCP/IP.

  • Zabbix

  • MySQL