Sr. Software Engineer: Agentic Evaluation

Apple Apple · Big Tech · Cupertino, CA +1 · Machine Learning and AI

This role focuses on building and maintaining the infrastructure, tooling, and pipelines for evaluating Siri, Apple's AI assistant, at scale. The engineer will extend evaluation capabilities to new platforms, support new features, diagnose failures, and contribute to architecture decisions for evaluation systems. Experience with evaluating ML, LLM, or agent-based systems is preferred.

What you'd actually do

  1. Extending evaluation capabilities to new devices, platforms, and runtime environments, with designs that favor portability over any single target
  2. Supporting the evaluation of new Siri features and interaction modalities, working from ambiguous early requirements toward concrete, automated coverage
  3. Diagnosing failures across the stack, from environment provisioning through pipeline execution to scoring, enabling auto-diagnostics and driving durable fixes
  4. Contributing to architecture decisions for the team's evaluation systems
  5. Partnering across engineering, infrastructure, and program teams to align on interfaces, priorities, and shared standards

Skills

Required

  • Strong programming skills in one or more compiled languages (Swift, C++ or Objective-C)
  • Python scripting skills for tooling and automation
  • Solid understanding of computer science fundamentals
  • Ability to quickly learn new technologies and adapt to evolving requirements
  • Excellent communication skills and ability to collaborate across teams
  • M.S. or B.S. in Computer Science, Machine Learning, or related field (or equivalent experience)

Nice to have

  • Experience staging, provisioning, or controlling test or evaluation environments to produce repeatable, deterministic conditions
  • Experience evaluating ML, LLM or agent-based systems, including familiarity with metrics, scoring methodology, or trajectory and outcome analysis
  • Experience designing or operating test infrastructure at scale, such as device provisioning, environment restore, warm pools, or continuous integration systems
  • Proficiency with Python and Swift in a production setting
  • A track record of approaching problems flexibly and cutting through ambiguity, adapting your approach to reach the right outcome and setting a clear path when requirements are not yet defined
  • A talent for focusing and simplifying, stripping away what is not essential and distilling complex decisions down to the factors that matter
  • A history of collaborating across teams and communicating effectively with both technical and program audiences

What the JD emphasized

  • evaluate Siri reliably and at scale
  • evaluating ML, LLM or agent-based systems
  • track record of approaching problems flexibly and cutting through ambiguity

Other signals

  • evaluating ML, LLM or agent-based systems
  • infrastructure, tooling, and pipelines that let us evaluate Siri reliably and at scale
  • diagnosing failures across the stack