Senior Site Reliability Engineer - Fleet Reliability
💰 $80,000 – $130,000/yr
Job Description
About Lambda
Lambda is The Superintelligence Cloud, a leader in AI cloud infrastructure serving tens of thousands of customers worldwide. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. We're building the world's best AI cloud infrastructure, and we're looking for exceptional talent to join our engineering team.
Role Overview
As a Senior Site Reliability Engineer specializing in Fleet Reliability, you'll play a critical role in ensuring the stability, performance, and scalability of Lambda's cloud infrastructure. Engineering at Lambda is responsible for building and scaling our cloud offering, including the Lambda website, cloud APIs, internal systems, and tooling for deployment, management, and maintenance. You'll work at the intersection of infrastructure reliability and AI compute optimization.
What You'll Do
- Define and implement Fleet Health metrics and indicators to objectively measure and continuously improve system availability
- Collaborate with the observability team to design comprehensive monitoring and alerting systems that proactively predict, detect, and respond to issues or anomalies
- Create detailed runbooks and develop automated remediations for common failure scenarios to reduce incident response time
- Build robust automation and auditing systems to ensure compliance while improving efficiency and productivity across operations
- Participate in on-call rotations and provide expert support for incident response and resolution
- Implement and integrate advanced logging and metrics across platforms including Datadog, Prometheus, OpenTelemetry, Grafana, and SumoLogic
Your Qualifications
- 7+ years of professional experience in Site Reliability Engineering, DevOps, or equivalent infrastructure roles
- Strong understanding of modern AI infrastructure, including GPU architectures and hardware performance optimization
- Deep expertise with Linux-based systems in distributed environments at scale
- Solid proficiency in Python and Go, with demonstrated experience collaborating with software engineering teams to improve internal tooling
- Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, and SumoLogic
- Proficiency in automation and configuration management tools including Ansible and Terraform
- Familiarity with major cloud platforms such as OCI, AWS, GCP, or Azure
- Excellent problem-solving and troubleshooting skills with ability to debug complex distributed systems
- Strong communication and collaboration abilities to work effectively across engineering teams
- Passion for continuous improvement and innovation in infrastructure reliability
Nice to Have
- Experience in the machine learning or computer hardware industry
- Knowledge of containerization and orchestration technologies such as Docker and Kubernetes
- Track record of building scalable infrastructure for high-performance computing environments
Work Arrangement
This position requires presence in our San Francisco office location 4 days per week. Lambda's designated work from home day is currently Tuesday. Please note: This is not a fully remote position.
💰 Compensation not publicly listed. Market estimate for similar roles: from $80K, varying by experience and location.