Automation and Triage Engineer, Siri

Apple Apple · Big Tech · Cupertino, CA +1 · Software and Services

This role focuses on building and maintaining automated test suites and evaluation frameworks for Siri, ensuring its AI quality and performance across various Apple platforms. It involves investigating complex failures in Siri's AI pipeline, distinguishing regressions, and partnering with engineering and ML teams to define and track quality metrics. The role requires strong software engineering skills, experience with agentic systems and LLM evaluation, and familiarity with on-device AI and conversational systems.

What you'd actually do

  1. Design and maintain end-to-end test automated test suite for Siri across iOS, macOS, iPadOS, CarPlay, and other Apple platforms.
  2. Author and scale evaluation scenarios that reflect real-world user intent and on-device context.
  3. Investigate and triage complex failures across Siri's AI stack — planner behavior, tool execution, search, context retrieval, and response generation.
  4. Distinguish true product regressions from infrastructure noise, and drive root cause analysis to clear, actionable outcomes.
  5. Partner with engineering, ML, and product experience teams to define quality metrics, track regressions, and validate improvements before they ship.

Skills

Required

  • Bachelor's Degree in Computer Science or related field.
  • 8+ years of experience in a software development or test engineering role, with demonstrated leadership in quality strategy, and automation.
  • Strong software engineering fundamentals with hands-on experience in Python, Swift, or both, and a track record of building test automation frameworks, CI/CD pipelines, or evaluation infrastructure for complex software systems.
  • Experience with agentic coding systems, using AI-assisted development tools to accelerate implementation, prototype evaluation pipelines, and tackle complex engineering problems with speed and precision.
  • Familiarity with machine learning concepts and LLM-based systems - including evaluation methodologies, prompt design, and model behavior analysis.

Nice to have

  • Experience with on-device AI, natural language understanding, or conversational systems is a strong plus, as is familiarity with scenario-based testing, or agent trajectory analysis.
  • Prior work on quality or reliability for consumer-facing AI products is especially valued.

What the JD emphasized

  • rigorous evaluation infrastructure
  • complex failures
  • AI pipeline
  • automated pipelines
  • user intent
  • on-device context
  • measurable signal
  • ML teams
  • agentic coding systems
  • AI-assisted development tools
  • machine learning concepts
  • LLM-based systems
  • evaluation methodologies
  • prompt design
  • model behavior analysis
  • on-device AI
  • natural language understanding
  • conversational systems
  • scenario-based testing
  • agent trajectory analysis
  • quality or reliability for consumer-facing AI products

Other signals

  • AI quality
  • intelligent systems
  • evaluation infrastructure
  • complex failures
  • AI pipeline
  • automated pipelines
  • user intent
  • on-device context
  • measurable signal
  • ML teams
  • agentic coding systems
  • AI-assisted development tools
  • machine learning concepts
  • LLM-based systems
  • evaluation methodologies
  • prompt design
  • model behavior analysis
  • on-device AI
  • natural language understanding
  • conversational systems
  • scenario-based testing
  • agent trajectory analysis
  • quality or reliability for consumer-facing AI products