Senior Software Engineer, Site Reliability Engineer (sre)

Harvey Harvey · AI Frontier · San Francisco, CA · Engineering

This role is for a Senior Software Engineer on the Site Reliability team at Harvey, focusing on ensuring the reliability, scalability, and performance of their legal AI platform. Responsibilities include designing and managing infrastructure, leading incident management, automating operational tasks, and optimizing costs. The role requires expertise in IaC, observability tools, cloud platforms, and programming languages like Python, Bash, or Go.

What you'd actually do

  1. Design, implement, and manage monitoring, alerting, and infrastructure resources (compute, storage, networking) across 50+ global regions
  2. Lead incident management processes, including postmortems, root cause analyses, and driving actionable improvements
  3. Automate operational tasks and workflows, building tools and processes for capacity planning, graceful rollouts, and safe data access to maintain high reliability and reduce manual intervention
  4. Collaborate across teams to drive reliability, security, and compliance throughout the software lifecycle
  5. Optimize infrastructure costs through strategic capacity planning and build-versus-buy decisions while maintaining system performance, reliability, and functionality.

Skills

Required

  • 5+ years of experience in Site Reliability Engineering or similar roles supporting production environments
  • Expertise in infrastructure as code(IaC) tools (Pulumi, Terraform, CloudFormation, etc.)
  • Deep familiarity with observability tools (Datadog, Sentry, etc.) and incident response practices (PagerDuty, IncidentIO, etc.)
  • Proficiency with cloud infrastructure platforms (Azure, GCP, AWS, etc.)
  • Strong programming skills (Python, Bash, Go, or similar languages)
  • Proven track record of diagnosing complex system problems and implementing durable solutions
  • Solid understanding of CI/CD, Kubernetes, containerization, networking, databases, and cloud security principles
  • Excellent problem-solving skills, meticulous attention to detail, and a commitment to operational excellence