Staff Software Engineer, Core Infrastructure

Harvey Harvey · AI Frontier · New York, NY · Engineering

Staff Software Engineer on the Core Infrastructure team responsible for designing, building, and scaling infrastructure systems for a global legal AI platform. This includes managing multi-cloud infrastructure (Azure, GCP), Kubernetes, networking, observability, and distributed systems for reliability and efficiency. The role involves optimizing inference request routing, rate limiting, and deployment strategies to support a rapidly growing user base and enterprise clients.

What you'd actually do

  1. Design and build scalable, fault-tolerant infrastructure systems that power Harvey's AI platform across multiple cloud regions
  2. Own and evolve our multi-cloud infrastructure (Azure, GCP), including Kubernetes orchestration, networking, and container management
  3. Lead technical initiatives around observability, incident response, and operational excellence — building systems that enable rapid detection and resolution of issues
  4. Architect and optimize our distributed systems for reliability, including load balancing, quota management, and failover mechanisms
  5. Partner with Product Engineering and Security teams to ensure our infrastructure is an accelerant, not a constraint

Skills

Required

  • Infrastructure Engineering
  • Platform Engineering
  • Distributed Systems
  • Cloud Infrastructure (Azure preferred)
  • Kubernetes
  • Container Orchestration
  • Networking
  • Cloud Security
  • Infrastructure as Code (Terraform, Pulumi, CloudFormation)
  • Observability Tools (Datadog, Sentry)
  • Incident Response Practices (PagerDuty)
  • Python
  • Go

Nice to have

  • AI/ML Workloads Infrastructure
  • High-throughput Inference Systems
  • Distributed Rate Limiting
  • Load Balancing
  • Quota Management Systems
  • Multi-tenant Platforms
  • Security and Compliance
  • Cross-functional Project Leadership

What the JD emphasized

  • 10+ years of experience in Infrastructure Engineering or Platform Engineering in a production environment
  • Long track record building and scaling complex, large-scale distributed systems
  • Deep proficiency with cloud infrastructure platforms (Azure preferred; GCP or AWS experience transfers well)
  • Strong fluency in Infrastructure as Code (IaC) tools — Terraform, Pulumi, or CloudFormation
  • Solid understanding of Kubernetes, container orchestration, networking, and cloud security at scale
  • Experience with observability tools (Datadog, Sentry) and incident response practices (PagerDuty, [Incident.io])
  • Strong programming skills in Python, Go, or similar languages
  • Excellent problem-solving skills, a "spidey sense" of where things could go wrong, and a commitment to operational excellence