Senior Site Reliability Engineer - Fleet Reliability

lambda·April 6, 2026·0 views

🌍 Hybrid · San Francisco, California, USAFull-time

💰 $80,000 – $130,000/yr

Site Reliability Engineering Prometheus Terraform Python Go Kubernetes GPU Infrastructure Datadog

Job Description

About Lambda

Lambda is The Superintelligence Cloud, a leader in AI cloud infrastructure serving tens of thousands of customers worldwide. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. We're building the world's best AI cloud infrastructure, and we're looking for exceptional talent to join our engineering team.

Role Overview

As a Senior Site Reliability Engineer specializing in Fleet Reliability, you'll play a critical role in ensuring the stability, performance, and scalability of Lambda's cloud infrastructure. Engineering at Lambda is responsible for building and scaling our cloud offering, including the Lambda website, cloud APIs, internal systems, and tooling for deployment, management, and maintenance. You'll work at the intersection of infrastructure reliability and AI compute optimization.

What You'll Do

Define and implement Fleet Health metrics and indicators to objectively measure and continuously improve system availability
Collaborate with the observability team to design comprehensive monitoring and alerting systems that proactively predict, detect, and respond to issues or anomalies
Create detailed runbooks and develop automated remediations for common failure scenarios to reduce incident response time
Build robust automation and auditing systems to ensure compliance while improving efficiency and productivity across operations
Participate in on-call rotations and provide expert support for incident response and resolution
Implement and integrate advanced logging and metrics across platforms including Datadog, Prometheus, OpenTelemetry, Grafana, and SumoLogic

Your Qualifications

7+ years of professional experience in Site Reliability Engineering, DevOps, or equivalent infrastructure roles
Strong understanding of modern AI infrastructure, including GPU architectures and hardware performance optimization
Deep expertise with Linux-based systems in distributed environments at scale
Solid proficiency in Python and Go, with demonstrated experience collaborating with software engineering teams to improve internal tooling
Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, and SumoLogic
Proficiency in automation and configuration management tools including Ansible and Terraform
Familiarity with major cloud platforms such as OCI, AWS, GCP, or Azure
Excellent problem-solving and troubleshooting skills with ability to debug complex distributed systems
Strong communication and collaboration abilities to work effectively across engineering teams
Passion for continuous improvement and innovation in infrastructure reliability

Nice to Have

Experience in the machine learning or computer hardware industry
Knowledge of containerization and orchestration technologies such as Docker and Kubernetes
Track record of building scalable infrastructure for high-performance computing environments

Work Arrangement

This position requires presence in our San Francisco office location 4 days per week. Lambda's designated work from home day is currently Tuesday. Please note: This is not a fully remote position.

💰 Compensation not publicly listed. Market estimate for similar roles: from $80K, varying by experience and location.