Site Reliability Engineer (SRE) 

Our client is seeking an experienced Site Reliability Engineer (SRE) to join our team. The ideal candidate will have a strong background in cloud technologies (specifically in Azure), and a deep understanding of SRE practices. As a key team member, you will help ensure the reliability, scalability, and performance of customer systems. This role will contribute to the automation of SDLC processes to enable smooth operation. We value and encourage diversity in the workplace. Women, minorities, and veterans are highly encouraged to apply. Thanks!

Location: Hybrid (Portland, OR area)

Type: Perm

Responsibilities: 

  • Play a crucial role in ensuring the availability, latency, performance, efficiency, and stability of critical infrastructure, supporting a range of data platforms, applications, and services.
  • Collaborate closely with development teams to implement and maintain reliable and scalable systems.
  • Proactively monitor and identify potential issues that could impact the availability of our systems.
  • Implement and maintain automated alerting mechanisms to notify the appropriate parties of potential outages or performance degradation.
  • Collaborate with development teams to design and implement solutions that enhance system resilience and reduce downtime.
  • Optimize resource utilization and minimize unnecessary expenditure on IT infrastructure.
  • Collaborate with development teams to optimize resource allocation for new applications and services.
  • Participate in the release planning process to ensure that software releases are conducted smoothly and without disruptions.
  • Design, implement, and maintain a comprehensive monitoring infrastructure to track the health and performance of our systems. Collaborate across broad groups within large IT organizations to deliver results.
  • Expect project-based work with multiple external customers.

Qualifications: 

  • Experience architecting, designing and/or implementing solutions with Azure cloud tooling
  • Experience with cloud infrastructure and tooling such as Kubernetes (AKS), Docker, CI/CD pipelines, Pulumi, Terraform
  • Ability to read and write .Net (C#) code
  • Experience with CosmoDB and SQL Server
  • Experience administrating Linux operating system
  • 5 years of experience in Site Reliability, debugging, diagnosing, and correcting errors and resolving high severity incidents
  • Experience configuring and managing monitoring and alerting tools on Azure cloud infrastructure.
  • Strong background in networking and configuration of cloud networks