Siri, Eval Architect Engineer

Apple Apple · Big Tech · Cupertino, CA · Machine Learning and AI

The role focuses on defining the architecture for systems that measure Siri's quality across platforms and model updates. It involves building evaluation infrastructure for large-scale automation, simulation, AI-powered auto-evaluators, and agentic fix pipelines. The Eval Systems Architect will own the technical vision and system architecture for Siri's evaluation stack, ensuring coherence, scalability, and trustworthiness, and will influence the technical roadmap for the evaluation platform.

What you'd actually do

  1. Own the end-to-end technical vision and system architecture across our entire evaluation stack, ensuring that we build toward a coherent, scalable, and trustworthy system.
  2. Own the technical architecture of Siri's evaluation infrastructure — a system spanning real-device automation, simulated product evaluation, AI-powered auto-evaluators, developer workflows, and observability tooling.
  3. Work across the Agentic Eval Engineering and Siri to ensure architectural coherence, define interfaces and contracts between systems, and drive the technical roadmap for the evaluation platform as a whole.
  4. Lead a first-principles review of existing evaluation tooling and infrastructure — identifying gaps, redundancies, and opportunities to simplify or unify.
  5. Represent the technical perspective in leadership discussions, influence build-vs-integrate decisions, and set the standards that enable teams to move fast without creating fragmentation.

Skills

Required

  • BS/MS/PhD in Computer Science, Software Engineering, or a related field.
  • 10+ years of software engineering experience, with at least 5 years in a systems architecture, staff/principal engineer, or technical leadership role.
  • Proven track record of designing and shipping large-scale distributed systems serving multiple teams or organizations.
  • Deep expertise in system design: API design, service architecture, data flow modeling, interface contracts, and schema evolution.
  • Solid software engineering fundamentals with production experience, including CI/CD, testing strategies, system monitoring, debugging complex multi-service systems, and code maintainability.
  • Demonstrated expertise in using AI-assisted software development workflows to accelerate engineering while maintaining code quality.

Nice to have

  • Experience architecting evaluation, testing, or quality infrastructure at scale — particularly for AI/ML products where quality is non-binary and continuous.
  • Experience with building LLM applications, LLM-as-judge evaluation frameworks, and offline evaluation pipelines.
  • Familiarity with MLOps principles for model lifecycle management and training data pipelines.
  • Experience with VM orchestration, fleet management, or large-scale job scheduling systems.
  • Knowledge of simulation and service virtualization techniques for complex software stacks.
  • Experience with observability platforms (metrics, logging, tracing, dashboarding) and defining SLOs for platform reliability.
  • Experience with agentic AI systems, including tool-use, multi-step reasoning, and human-in-the-loop workflows.
  • Track record of leading cross-team architectural initiatives (e.g., platform migrations, API unification, system consolidation) in organizations with 50+ engineers

What the JD emphasized

  • architectural coherence
  • system-wide consistency
  • architectural decisions

Other signals

  • evaluation infrastructure
  • quality measurement
  • large-scale automation
  • model-in-the-loop simulation
  • AI-powered auto-evaluators
  • closed-loop agentic fix pipelines
  • end-to-end technical vision
  • system architecture
  • evaluation stack
  • scalable and trustworthy system