Senior Applied Scientist - AI Evaluation & Quality Systems

Apple Apple · Big Tech · Seattle, WA · Machine Learning and AI

Senior Applied Scientist focused on building and scaling AI evaluation and quality systems. The role involves developing methodologies, tooling, and autonomous QA agents to ensure the trustworthiness and quality of AI/ML systems, with a strong emphasis on human-in-the-loop evaluation and anomaly detection. Requires a blend of research and engineering skills to prototype, validate, and ship solutions.

What you'd actually do

  1. Design and implement scalable ground truth generation pipelines across varied task types, annotation modalities, and cold start conditions
  2. Build and maintain calibration frameworks that keep LLM evaluators anchored to human judgment over time
  3. Develop anomaly detection systems that surface evaluator drift, distribution shifts, and coverage gaps across human annotation and automated evaluation pipelines
  4. Design, build, and deploy autonomous QA agents targeting specific facets of evaluation quality, architected for generalizability and self-service adoption across teams
  5. Partner closely with cross-functional teams to ensure evaluation systems meet the highest standards of accuracy, consistency, and relevance

Skills

Required

  • Large Language Models
  • prompt engineering
  • evaluation methodology for generative AI
  • LLM-as-a-judge design
  • meta-evaluation
  • failure mode analysis
  • human-in-the-loop evaluation systems
  • data quality at scale
  • ground truth generation pipelines
  • Python
  • ML frameworks
  • production LLM pipelines
  • production LLM agents
  • anomaly detection systems
  • drift detection
  • distribution analysis
  • systematic bias identification

Nice to have

  • PhD in Computer Science, Machine Learning, Statistics, or a related field
  • agent architectures that are configurable and extensible
  • communication skills
  • influence technical direction

What the JD emphasized

  • 5+ years of industry experience in applied science or machine learning with demonstrated impact on shipped systems
  • Strong working knowledge of evaluation methodology for generative AI
  • Proficiency in Python and relevant ML frameworks, with production experience building, deploying, and monitoring LLM-based pipelines and agents

Other signals

  • building systems and methodologies for AI evaluation
  • developing autonomous QA agents
  • ensuring data powering AI/ML systems meets highest standards
  • validating signals used to train and evaluate AI