Evaluation & Insights Machine Learning Engineer

Apple Apple · Big Tech · Seattle, WA · Software and Services

This role focuses on evaluating and improving AI systems by analyzing AI outputs, developing evaluation frameworks, and translating findings into actionable improvements. It involves assessing model behavior, identifying edge cases, and ensuring AI systems are reliable, safe, and aligned with human expectations. The role also involves building MLOps and automation for evaluation pipelines and collaborating with various teams to refine model performance.

What you'd actually do

  1. Lead Rigorous Model Evaluations: Architect and execute comprehensive evaluation suites for LLMs and multimodal models, identifying edge cases in multi-step reasoning, factuality, adversarial robustness, safety, and alignment.
  2. Advanced Scoring Frameworks: Develop deterministic, heuristic, and LLM-assisted evaluation frameworks (e.g., LLM-as-a-judge, reward modeling) to quantify human-perceived quality metrics (e.g., helpfulness, hallucination rates).
  3. Actionable Signal Extraction: Translate qualitative failure modes into quantifiable loss patterns, programmatic guardrails, and actionable data-mixture adjustments for model training and inference.
  4. Improve Performance: Partner with engineering teams to refine model behavior, leveraging evaluation telemetry to inform prompt engineering, Retrieval-Augmented Generation (RAG) strategies, and model fine-tuning.
  5. Latent Pattern Recognition: Apply advanced ML techniques (e.g., embedding-based clustering, representation learning, perturbation analysis) to systematically map error taxonomies and latent failure manifolds in model outputs.

Skills

Required

  • Python
  • PyTorch
  • JAX
  • Hugging Face
  • scalable ML inference pipelines
  • model-evaluation workflows
  • structured rating frameworks
  • interpret unstructured model outputs
  • synthesize qualitative findings into actionable engineering guidance
  • fine-tuning LLMs
  • evaluating LLMs
  • evaluating multimodal models
  • evaluating NLP systems
  • AI quality metrics
  • hallucination detection techniques
  • model alignment (RLHF/DPO)
  • LLM-as-a-judge frameworks
  • building internal tools or automated pipelines for ML workflows
  • MLflow
  • Weights & Biases
  • advanced prompt engineering
  • RAG architectures
  • vector databases
  • semantic search
  • Fine-Tuning

Nice to have

  • human factors
  • HCI
  • cognitive science methodologies
  • Ray
  • vLLM
  • embedding-based clustering
  • representation learning
  • perturbation analysis
  • SelfCheckGPT
  • G-Eval
  • DeepEval

What the JD emphasized

  • evaluate AI models
  • evaluation frameworks
  • model behavior analysis
  • assess, interpret, and improve the behavior of advanced AI models
  • evaluate and improve AI systems
  • evaluate AI models
  • evaluation suites
  • evaluation frameworks
  • evaluation telemetry
  • evaluation pipelines
  • evaluation frameworks
  • evaluation pipelines
  • model evaluation infrastructure

Other signals

  • evaluate AI models
  • develop evaluation frameworks
  • assess, interpret, and improve the behavior of advanced AI models
  • architect and execute comprehensive evaluation suites for LLMs and multimodal models
  • develop deterministic, heuristic, and LLM-assisted evaluation frameworks
  • translate qualitative failure modes into quantifiable loss patterns
  • refine model behavior, leveraging evaluation telemetry
  • map error taxonomies and latent failure manifolds in model outputs
  • codify evaluation metrics, automate regression testing
  • define quantitative evaluation frameworks that capture nuanced human factors
  • build automated evaluation pipelines utilizing LLMs to assess outputs at scale
  • translate product requirements into scalable, reliable, and efficient model evaluation infrastructure