Evaluation Reliability Sre

Apple Apple · Big Tech · Cupertino, CA +1 · Machine Learning and AI

This role focuses on the reliability and operational excellence of ML evaluation infrastructure, specifically the production backbone for Siri's quality signal. It involves managing resources, orchestration, on-call response, and observability systems to ensure the trustworthiness of evaluation infrastructure. The role requires hands-on experience in site reliability, infrastructure engineering, and operating production systems, with a focus on proactive reliability work and incident response.

What you'd actually do

  1. Own reliability outcomes across the evaluation infrastructure stack: orchestration, capacity, and service health
  2. Own runbook quality across the team: author runbooks for complex failure categories and set the bar that guides other engineers to produce the same quality
  3. Build deep expertise in the device orchestration and provisioning layers — understand quota management, retry behavior, and failure modes well enough to diagnose upstream issues independently
  4. Instrument infrastructure components that lack observability; if a failure is hard to detect, make it easy to detect before the next occurrence
  5. Balance incident response with proactive reliability work — automation and eliminating recurring failures are core deliverables

Skills

Required

  • 5+ years of site reliability, infrastructure, or platform engineering experience
  • direct on-call ownership in production systems
  • Hands-on orchestration experience (Kubernetes or equivalent)
  • cluster health, resource management, scheduling, and failure diagnosis at scale

Nice to have

  • Experience owning or closely operating a device or VM provisioning pipeline
  • familiarity with virtualization-layer failure modes
  • Track record of improving system reliability against measurable outcomes
  • Incident command discipline
  • Depth in at least one of: distributed systems reliability, device management infrastructure, evaluation or ML platform operations
  • Demonstrated cross-team technical influence

What the JD emphasized

  • production systems
  • on-call ownership
  • incident investigations end-to-end
  • agentic coding tools
  • system reliability against measurable outcomes

Other signals

  • ML evaluation infrastructure
  • operational excellence
  • reliability engineering
  • production backbone