Machine Learning Engineer

Apple Apple · Big Tech · London, United Kingdom +1 · Machine Learning and AI

Machine Learning Engineer focused on Evaluation & Insights for the Human-Centered AI team at Apple Media Services. The role involves evaluating and optimizing Foundation Models and generative AI systems, architecting evaluation frameworks, designing MLOps pipelines, and translating failure modes into guardrails and training signals. This position bridges human perception and algorithmic performance, working cross-functionally to ensure AI experiences are reliable, safe, and aligned with human expectations.

What you'd actually do

  1. Lead Rigorous Model Evaluations: Architect and execute comprehensive evaluation suites for LLMs and multimodal models, identifying edge cases in multi-step reasoning, factuality, adversarial robustness, safety, and alignment.
  2. Advanced Scoring Frameworks: Develop deterministic, heuristic, and LLM-assisted evaluation frameworks (e.g., LLM-as-a-judge, reward modeling) to quantify human-perceived quality metrics (e.g., helpfulness, hallucination rates).
  3. Actionable Signal Extraction: Translate qualitative failure modes into quantifiable loss patterns, programmatic guardrails, and actionable data-mixture adjustments for model training and inference.
  4. Improve Performance: Partner with engineering teams to refine model behavior, leveraging evaluation telemetry to inform prompt engineering, Retrieval-Augmented Generation (RAG) strategies, and model fine-tuning.
  5. Latent Pattern Recognition: Apply advanced ML techniques (e.g., embedding-based clustering, representation learning, perturbation analysis) to systematically map error taxonomies and latent failure manifolds in model outputs.

Skills

Required

  • Python
  • PyTorch
  • JAX
  • Hugging Face
  • LLMs
  • multimodal models
  • NLP systems
  • AI quality metrics
  • hallucination detection techniques
  • model alignment
  • RLHF/DPO
  • LLM-as-a-judge frameworks

Nice to have

  • human factors
  • HCI
  • cognitive science methodologies
  • ML inference pipelines
  • model-evaluation workflows
  • structured rating frameworks
  • MLflow
  • Weights & Biases
  • prompt engineering
  • RAG architectures
  • vector databases
  • semantic search
  • Fine-Tuning

What the JD emphasized

  • evaluate AI models
  • evaluate and optimize Foundation Models and generative AI systems
  • architect robust evaluation frameworks
  • design scalable MLOps pipelines for model assessment
  • translate qualitative failure modes into programmatic guardrails and training signals
  • assess, interpret, and improve the behavior of advanced AI models
  • ensure that our AI experiences are reliable, safe, and aligned with human expectations
  • Lead Rigorous Model Evaluations
  • Architect and execute comprehensive evaluation suites
  • identify edge cases
  • Advanced Scoring Frameworks
  • Develop deterministic, heuristic, and LLM-assisted evaluation frameworks
  • quantify human-perceived quality metrics
  • Actionable Signal Extraction
  • Translate qualitative failure modes into quantifiable loss patterns, programmatic guardrails, and actionable data-mixture adjustments
  • Improve Performance
  • refine model behavior
  • Latent Pattern Recognition
  • systematically map error taxonomies and latent failure manifolds
  • MLOps & Automation
  • codify evaluation metrics
  • automate regression testing
  • integrate human-centric assessments into ML CI/CD pipelines
  • Distributed Evaluation Pipelines
  • Architect scalable, distributed inference and processing pipelines
  • high-throughput model evaluation
  • automated annotation
  • output analysis at scale
  • Human-Centric Metrics
  • Define quantitative evaluation frameworks that capture nuanced human factors
  • trust calibration
  • conversational state tracking
  • interpretability
  • Auto-Evaluator Systems
  • Build automated evaluation pipelines utilizing LLMs to assess outputs at scale
  • optimizing for high correlation with human baseline annotations
  • Cross-Functional Partnership
  • translate product requirements into scalable, reliable, and efficient model evaluation infrastructure
  • Advanced proficiency in Python and modern deep learning ecosystems
  • Strong ability to interpret unstructured model outputs
  • synthesize qualitative findings into actionable engineering guidance and training objectives
  • Hands-on experience developing, fine-tuning, or evaluating LLMs, multimodal models, and NLP systems
  • Deep familiarity with AI quality metrics
  • hallucination detection techniques
  • model alignment
  • LLM-as-a-judge frameworks
  • Proven experience building scalable ML inference pipelines
  • model-evaluation workflows
  • structured rating frameworks for large-scale AI systems
  • Experience building internal tools or automated pipelines for ML workflows

Other signals

  • evaluating Foundation Models and generative AI systems
  • architect robust evaluation frameworks
  • design scalable MLOps pipelines for model assessment
  • translate qualitative failure modes into programmatic guardrails and training signals