Systems Engineer - Evaluation Engineering

Apple Apple · Big Tech · Cupertino, CA +1 · Machine Learning and AI

Systems Engineer focused on building and scaling the infrastructure for an AI Agentic Evaluation Platform. This involves designing distributed execution engines, internal developer platforms, backend APIs, stream processing, and deployment topologies for large-scale agent simulations and LLM-as-a-judge pipelines. The role emphasizes reliability, observability, and guardrails for complex AI systems.

What you'd actually do

  1. Architect and scale the core asynchronous engine responsible for orchestrating thousands of parallel agent simulations, validation tests, and LLM-as-a-judge pipelines.
  2. Design and build self-service infrastructure, CLI tools, and internal APIs that allow ML and product teams to easily integrate evaluation pipelines into their CI/CD workflows.
  3. Design, build, and maintain highly performant, type-safe APIs (gRPC/REST) capable of serving complex evaluation pipelinee, trace data, and real-time generation metrics.
  4. Build robust data pipelines to ingest and transform high-volume execution traces. Ensure immutable data lineage so that every evaluation metric can be perfectly traced back to its raw generation for granular error attribution.
  5. Own the deployment topologies of the evaluation platform across multi-tenant clusters using declarative infrastructure and continuous delivery practices.

Skills

Required

  • MS in computer science or equivalent
  • 7+ years of experience as distributed systems engineer, platform engineer or equivalent
  • Python (asyncio) or Java
  • gRPC/Protobuf, GraphQL, or REST (FastAPI)
  • distributed systems engineering
  • platform engineering
  • API design
  • concurrency
  • enterprise scale

Nice to have

  • PostgreSQL
  • Kafka, AWS SQS/SNS, RabbitMQ, or Redis Streams
  • Kubernetes (orchestration, custom operators, service meshes like Istio or Linkerd)
  • AWS, GCP, or Azure
  • Agentic RAG platforms
  • developer-facing infrastructure tooling
  • Terraform
  • GitHub or ArgoCD

What the JD emphasized

  • core Siri Agentic Evaluation Platform
  • massive-scale distributed problem
  • high-throughput agentic simulations
  • orchestrate multi-model judging pipelines
  • billions of tokens
  • complex data types
  • thousands of parallel agent simulations
  • LLM-as-a-judge pipelines
  • immutable data lineage
  • perfectly traced back
  • granular error attribution
  • multi-tenant clusters
  • deep observability
  • distributed tracing
  • structured metrics
  • alerting
  • prevent downstream API rate-limiting
  • cascading cluster failures
  • 7+ years of experience as distributed systems engineer, platform engineer or equivalent
  • Strong proficiency in languages optimized for concurrency and enterprise scale
  • Deep expertise in designing robust, versioned production APIs
  • Experience building Agentic RAG platforms or developer-facing infrastructure tooling

Other signals

  • evaluating AI models
  • building AI infrastructure
  • agentic systems