We are seeking a proactive Site Reliability Engineer (SRE) to drive reliability, performance, and efficiency across our systems and platforms. You'll work closely with Application Development, QA, Product, and Data Engineering teams to champion a DevOps/SRE culture rooted in automation, observability, and continuous improvement.
Key Responsibilities:
- Collaborate cross-functionally to promote SRE and DevSecOps best practices across the organization.
- Build and maintain reliable, scalable systems with a focus on availability, performance, and resiliency.
- Establish and monitor SLOs/SLIs, and develop comprehensive dashboards to support decision-making from both technical and business perspectives.
- Lead efforts to reduce toil through automation, self-healing systems, and advanced monitoring (e.g., synthetic monitoring, RUM).
- Apply observability and reliability testing practices from architecture through operations, leveraging Agile and product-based models.
- Drive the adoption of cutting-edge tools in observability, automation, platform engineering, AIOps, and MLOps.
- Contribute to and lead Communities of Practice (CoP) and SRE Office Hours to foster knowledge sharing and continuous improvement.
Qualifications:
SRE & DevOps Expertise:
- Strong experience in observability, toil reduction, incident response, and performance optimization.
- Proficient with monitoring tools such as Dynatrace, CloudWatch, and Azure Monitor.
- Skilled in IaC, CaC, JSON, and scripting with Python, Node.js, Ruby, PowerShell, and Shell.
- Deep understanding of Dynatrace advanced features: DT Guardian, RUM, Synthetic Monitoring, AI-based event correlation.
Cloud & Automation:
- Expert in AWS Cloud services: CDK, Lambda, CloudWatch, EKS, EC2, ELB, S3, SSM.
- Experience with log ingestion pipelines (AWS Firehose, Dynatrace OpenPipeline), and operational dashboards.
- Hands-on experience with Ansible Tower, AWS SSM, Bitbucket/GitHub, and CI/CD workflows.
Orchestration & Data:
- Familiarity with orchestration tools like Step Functions, Apache Airflow, and container platforms.
- Knowledge of data pipelines, data lakes, and databases (Redshift, RDS, Aurora, PostgreSQL, SQL Server, Oracle).
Leadership & Communication:
- Strong problem-solving and knowledge management skills.
- Effective communicator who bridges technical and business teams.
- Collaborative, inclusive leader who builds high-performing teams and fosters a culture of growth and recognition.
We’re an equal opportunity employer committed to increasing diversity and inclusion in today’s workforce. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. Minorities, women, LGBTQ candidates, and individuals with disabilities are encouraged to apply. If you require an accommodation, please review our
accessibility policy and reach out to our accessibility officer with any questions.