ML Engineer - Automated Evaluation and Adversarial Design

Apple Apple · Big Tech · Culver City +2 · Software and Services

ML Engineer focused on building and scaling automated evaluation systems and designing adversarial/stress-testing methodologies for AI-powered features in productivity and creative applications. The role involves assessing AI quality, particularly for multi-turn agentic experiences, and influencing model development decisions through rigorous evaluation.

What you'd actually do

  1. Define and own the automated evaluation approach for AI features, translating qualitative notions of quality into measurable, reproducible assessments across both single-turn and multi-turn agentic experiences
  2. Build adversarial test suites that target known and emerging model failure modes, including edge cases relevant to productivity application workflows including conversation-level failures such as context loss, instruction forgetting, and cascading errors across multi-step tasks
  3. Develop and execute stress test protocols that validate minimum performance thresholds under atypical input conditions including extended conversation lengths, adversarial mid-conversation topic shifts, and complex tool-use sequences
  4. Ensure alignment between automated and human evaluation methods on an ongoing basis, identifying and resolving systematic disagreements
  5. Collaborate with engineering partners to integrate evaluation into development and release workflows

Skills

Required

  • Bachelor's degree in Computer Science, Machine Learning, Statistics, or a related field
  • 4+ years of experience building or significantly extending ML evaluation systems, including designing evaluation benchmarks or quality assessment frameworks including evaluation of sequential or multi-step AI outputs
  • Experience independently defining evaluation architecture and methodology for AI or ML systems with the ability to design evaluation approaches where the unit of analysis is a conversation or session rather than a single output
  • Experience designing adversarial or red-teaming test methodologies for ML models or AI-powered features including adversarial scenarios that target failures across multi-turn interactions
  • Experience with Python and ML frameworks (PyTorch, TensorFlow, or equivalent) in production or near-production settings
  • Track record of owning technical direction for evaluation efforts across multiple features or product areas

Nice to have

  • Experience evaluating user-facing AI features in consumer applications, with an understanding of how technical metrics connect to user-perceived quality
  • Familiarity with productivity software or creative tools, with the ability to assess output quality from a user workflow perspective
  • Experience ensuring alignment between automated and human evaluation methods, including inter-annotator agreement analysis and bias detection
  • Track record of designing evaluation systems that scale across multiple features or product areas without requiring bespoke solutions for each
  • Experience evaluating different types of AI systems, including API-based and custom-trained models
  • Demonstrated ability to communicate evaluation findings and readiness assessments to cross-functional partners
  • Experience leveraging automation to scale evaluation data generation and analysis
  • Experience building evaluation pipelines for conversational AI, dialogue systems, or agentic workflows, including turn-level and session-level automated scoring
  • Familiarity with agent orchestration frameworks (LangChain, LangGraph, CrewAI, AutoGen) and observability tooling (LangSmith, Braintrust, Arize), with an understanding of how to instrument and evaluate multi-step agent runs
  • Experience designing adversarial tests for tool-use reliability, function-calling accuracy, or agent planning quality
  • Graduate degree in a relevant field

What the JD emphasized

  • automated evaluation systems
  • adversarial
  • stress-testing
  • multi-turn
  • agentic experiences
  • conversation flows
  • agent decision chains
  • model failure modes
  • conversation-level failures
  • multi-step tasks
  • tool-use reliability
  • function-calling accuracy
  • agent planning quality

Other signals

  • building and scaling automated evaluation systems
  • designing adversarial and stress-testing methodologies
  • evaluating AI features across a suite of productivity and creative applications
  • stress-testing entire conversation flows and agent decision chains
  • defining and owning the automated evaluation approach for AI features