Software Engineer, Site Reliability (sre)

Sierra Sierra · AI Frontier · San Francisco, CA · Engineering

Sierra is seeking a Software Engineer, Site Reliability (SRE) to build and maintain the reliability, observability, and scalability of their AI-driven infrastructure. The role involves owning the observability stack, designing scalable cloud infrastructure on AWS using Terraform, improving LLM deployments for performance and cost-effectiveness, and leading improvements in CI/CD and incident management. The SRE will define foundational SRE practices and influence engineering culture.

What you'd actually do

  1. Own Sierra’s observability stack—monitoring, alerting, logging, and tracing—to give engineers clear visibility into system health and performance.
  2. Partner with product and platform engineers to design systems that are reliable and scalable from day one—not as an afterthought.
  3. Design and implement scalable, reliable, and secure cloud infrastructure (AWS) using Terraform and modern DevOps tooling.
  4. Improve the reliability and scalability of our LLM deployments, ensuring robust, performant, and cost-effective operation.
  5. Lead improvements to deployment pipelines, CI/CD tooling, and incident management processes to reduce downtime and response time.

Skills

Required

  • 5+ years of hands-on experience in Site Reliability or Infrastructure engineering roles for complex SaaS or cloud-based systems.
  • Experience designing for availability, scalability, and reliability at both infrastructure and application layers.
  • Deep experience with Terraform, AWS services, container orchestration, and cloud networking (including IAM and VPC architecture).
  • Strong background in observability systems (e.g., Prometheus, Grafana, Datadog, or similar).
  • Experience working with enterprise customers and familiarity with their compliance and networking needs along with integration patterns.
  • Comfortable working in fast-moving environments and collaborating across product, ML, and core engineering teams.
  • Degree in Computer Science or a related field, or equivalent professional experience.

Nice to have

  • Experience with LLM infrastructure — optimizing inference performance, managing fine-tuned models, or large-scale model deployment.
  • Past experience in an early-stage startup environment, especially defining SRE culture and tooling from scratch.
  • Familiarity with incident management automation or self-healing infrastructure patterns.

What the JD emphasized

  • reliability
  • scalability
  • observability
  • LLM deployments
  • LLM infrastructure