Staff Software Engineer, Site Reliability Engineer (sre)

Harvey Harvey · AI Frontier · San Francisco, CA · Engineering

Staff Software Engineer, Site Reliability Engineer (SRE) at Harvey, a company transforming legal and professional services with AI. The role focuses on ensuring the reliability, scalability, and performance of the legal AI platform by designing, implementing, and managing infrastructure, leading incident management, automating operational tasks, establishing best practices for security and compliance, and optimizing infrastructure costs. The role requires 10+ years of experience in SRE or similar roles, expertise in IaC, observability tools, cloud platforms, strong programming skills, and a solid understanding of CI/CD, Kubernetes, and cloud security.

What you'd actually do

  1. Design, implement, and manage monitoring, alerting, and infrastructure resources (compute, storage, networking) across 50+ global regions
  2. Lead incident management processes, including postmortems, root cause analyses, and driving actionable improvements
  3. Automate operational tasks and workflows, building tools and processes for capacity planning, graceful rollouts, and safe data access to maintain high reliability and reduce manual intervention
  4. Establish best practices for security, compliance, and reliability and collaborate across teams to drive these principles throughout the software lifecycle
  5. Optimize infrastructure costs through strategic capacity planning and build-versus-buy decisions while maintaining system performance, reliability, and functionality

Skills

Required

  • Infrastructure as Code (IaC) tools (Pulumi, Terraform, CloudFormation, etc.)
  • Observability tools (Datadog, Sentry, etc.)
  • Incident response practices (PagerDuty, IncidentIO, etc.)
  • Cloud infrastructure platforms (Azure, GCP, AWS, etc.)
  • Programming skills (Python, Bash, Go, or similar languages)
  • Diagnosing complex system problems
  • CI/CD
  • Kubernetes
  • Containerization
  • Networking
  • Databases
  • Cloud security principles

Nice to have

  • Mentorship and technical leadership

What the JD emphasized

  • 10+ years of experience in Site Reliability Engineering or similar roles supporting production environments
  • Expertise in infrastructure as code(IaC) tools (Pulumi, Terraform, CloudFormation, etc.)
  • Deep familiarity with observability tools (Datadog, Sentry, etc.) and incident response practices (PagerDuty, IncidentIO, etc.)
  • Proficiency with cloud infrastructure platforms (Azure, GCP, AWS, etc.)
  • Strong programming skills (Python, Bash, Go, or similar languages)
  • Proven track record of diagnosing complex system problems and implementing durable solutions
  • Solid understanding of CI/CD, Kubernetes, containerization, networking, databases, and cloud security principles