Machine Learning Engineer

Apple · Big Tech · London, United Kingdom +1 · Machine Learning and AI

Machine Learning Engineer focused on Evaluation & Insights for the Human-Centered AI team at Apple Media Services. The role involves evaluating and optimizing Foundation Models and generative AI systems, architecting evaluation frameworks, designing MLOps pipelines, and translating failure modes into guardrails and training signals. This position bridges human perception and algorithmic performance, working cross-functionally to ensure AI experiences are reliable, safe, and aligned with human expectations.

What you'd actually do

Lead Rigorous Model Evaluations: Architect and execute comprehensive evaluation suites for LLMs and multimodal models, identifying edge cases in multi-step reasoning, factuality, adversarial robustness, safety, and alignment.
Advanced Scoring Frameworks: Develop deterministic, heuristic, and LLM-assisted evaluation frameworks (e.g., LLM-as-a-judge, reward modeling) to quantify human-perceived quality metrics (e.g., helpfulness, hallucination rates).
Actionable Signal Extraction: Translate qualitative failure modes into quantifiable loss patterns, programmatic guardrails, and actionable data-mixture adjustments for model training and inference.
Improve Performance: Partner with engineering teams to refine model behavior, leveraging evaluation telemetry to inform prompt engineering, Retrieval-Augmented Generation (RAG) strategies, and model fine-tuning.
Latent Pattern Recognition: Apply advanced ML techniques (e.g., embedding-based clustering, representation learning, perturbation analysis) to systematically map error taxonomies and latent failure manifolds in model outputs.

Skills

Required

Python
PyTorch
JAX
Hugging Face
LLMs
multimodal models
NLP systems
AI quality metrics
hallucination detection techniques
model alignment
RLHF/DPO
LLM-as-a-judge frameworks

Nice to have

human factors
HCI
cognitive science methodologies
ML inference pipelines
model-evaluation workflows
structured rating frameworks
MLflow
Weights & Biases
prompt engineering
RAG architectures
vector databases
semantic search
Fine-Tuning

What the JD emphasized

evaluate AI models
evaluate and optimize Foundation Models and generative AI systems
architect robust evaluation frameworks
design scalable MLOps pipelines for model assessment
translate qualitative failure modes into programmatic guardrails and training signals
assess, interpret, and improve the behavior of advanced AI models
ensure that our AI experiences are reliable, safe, and aligned with human expectations
Lead Rigorous Model Evaluations
Architect and execute comprehensive evaluation suites
identify edge cases
Advanced Scoring Frameworks
Develop deterministic, heuristic, and LLM-assisted evaluation frameworks
quantify human-perceived quality metrics
Actionable Signal Extraction
Translate qualitative failure modes into quantifiable loss patterns, programmatic guardrails, and actionable data-mixture adjustments
Improve Performance
refine model behavior
Latent Pattern Recognition
systematically map error taxonomies and latent failure manifolds
MLOps & Automation
codify evaluation metrics
automate regression testing
integrate human-centric assessments into ML CI/CD pipelines
Distributed Evaluation Pipelines
Architect scalable, distributed inference and processing pipelines
high-throughput model evaluation
automated annotation
output analysis at scale
Human-Centric Metrics
Define quantitative evaluation frameworks that capture nuanced human factors
trust calibration
conversational state tracking
interpretability
Auto-Evaluator Systems
Build automated evaluation pipelines utilizing LLMs to assess outputs at scale
optimizing for high correlation with human baseline annotations
Cross-Functional Partnership
translate product requirements into scalable, reliable, and efficient model evaluation infrastructure
Advanced proficiency in Python and modern deep learning ecosystems
Strong ability to interpret unstructured model outputs
synthesize qualitative findings into actionable engineering guidance and training objectives
Hands-on experience developing, fine-tuning, or evaluating LLMs, multimodal models, and NLP systems
Deep familiarity with AI quality metrics
hallucination detection techniques
model alignment
LLM-as-a-judge frameworks
Proven experience building scalable ML inference pipelines
model-evaluation workflows
structured rating frameworks for large-scale AI systems
Experience building internal tools or automated pipelines for ML workflows

Other signals

evaluating Foundation Models and generative AI systems
architect robust evaluation frameworks
design scalable MLOps pipelines for model assessment
translate qualitative failure modes into programmatic guardrails and training signals

Read full job description

Imagine what you could do here. At Apple, great new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish! Are you passionate about music, movies, and the world of Artificial Intelligence and Machine Learning? So are we! Join our Human-Centered AI team for Apple Media Services. In this role, you'll represent the user perspective on new features, review and analyze data, and evaluate AI models powering everything from search and recommendations to other innovative features. Collaborate with Data Scientists, Researchers, and Engineers to drive improvements across our platforms.

Description

We are looking for a Machine Learning Engineer focused on Evaluation & Insights for the Human-Centered AI team. In this role, you will bridge the gap between human perception and algorithmic performance, helping evaluate and optimize Foundation Models and generative AI systems. You will architect robust evaluation frameworks, design scalable MLOps pipelines for model assessment, and translate qualitative failure modes into programmatic guardrails and training signals (e.g., SFT, RLHF/DPO). This role blends deep ML engineering expertise with strong analytical judgment to assess, interpret, and improve the behavior of advanced AI models. You will work cross-functionally with Software Engineering, Product, Research and Responsible AI teams at Apple to ensure that our AI experiences are reliable, safe, and aligned with human expectations.

Responsibilities

Lead Rigorous Model Evaluations: Architect and execute comprehensive evaluation suites for LLMs and multimodal models, identifying edge cases in multi-step reasoning, factuality, adversarial robustness, safety, and alignment. Advanced Scoring Frameworks: Develop deterministic, heuristic, and LLM-assisted evaluation frameworks (e.g., LLM-as-a-judge, reward modeling) to quantify human-perceived quality metrics (e.g., helpfulness, hallucination rates). Actionable Signal Extraction: Translate qualitative failure modes into quantifiable loss patterns, programmatic guardrails, and actionable data-mixture adjustments for model training and inference. Improve Performance: Partner with engineering teams to refine model behavior, leveraging evaluation telemetry to inform prompt engineering, Retrieval-Augmented Generation (RAG) strategies, and model fine-tuning. Latent Pattern Recognition: Apply advanced ML techniques (e.g., embedding-based clustering, representation learning, perturbation analysis) to systematically map error taxonomies and latent failure manifolds in model outputs. MLOps & Automation: Develop robust MLOps workflows to codify evaluation metrics, automate regression testing across model checkpoints, and integrate human-centric assessments into ML CI/CD pipelines. Distributed Evaluation Pipelines: Architect scalable, distributed inference and processing pipelines (e.g., Ray, vLLM) for high-throughput model evaluation, automated annotation, and output analysis at scale. Human-Centric Metrics: Define quantitative evaluation frameworks that capture nuanced human factors, including trust calibration, conversational state tracking, and interpretability. Auto-Evaluator Systems: Build automated evaluation pipelines utilizing LLMs to assess outputs at scale, optimizing for high correlation with human baseline annotations. Cross-Functional Partnership: Collaborate with ML researchers, software developers, and product managers across Apple to translate product requirements into scalable, reliable, and efficient model evaluation infrastructure.

Minimum Qualifications

Bachelor’s or Master’s degree in Computer Science, Machine Learning, Artificial Intelligence, Cognitive Science, or a related technical field, with relevant industry experience in ML Engineering or Applied Research. Advanced proficiency in Python and modern deep learning ecosystems (PyTorch, JAX, Hugging Face). Strong ability to interpret unstructured model outputs (text, transcripts, embedding spaces) and synthesize qualitative findings into actionable engineering guidance and training objectives. Hands-on experience developing, fine-tuning, or evaluating LLMs, multimodal models, and NLP systems. Deep familiarity with AI quality metrics, hallucination detection techniques (e.g., SelfCheckGPT), model alignment (RLHF/DPO), and LLM-as-a-judge frameworks (e.g., G-Eval, DeepEval).

Preferred Qualifications

Knowledge of human factors, HCI, or cognitive science methodologies as applied to AI system design. Proven experience building scalable ML inference pipelines, model-evaluation workflows, and structured rating frameworks for large-scale AI systems. Experience building internal tools or automated pipelines for ML workflows using tools like MLflow, Weights & Biases, or similar platforms. Strong familiarity with advanced prompt engineering, RAG architectures (vector databases, semantic search), and Fine-Tuning .

At Apple, we're not all the same. And that's our greatest strength. We draw on the differences in who we are, what we've experienced and how we think. Because to create products that serve everyone, we believe in including everyone. Therefore, we are committed to treating all applicants fairly and equally. As a registered Disability Confident employer, we will work with applicants to make any reasonable accommodations. Apple will consider for employment all qualified applicants with criminal backgrounds in a manner consistent with applicable law. Learn more

At Apple, we believe accessibility is a fundamental human right. You’ll find that idea reflected in everything here — in our culture, our benefits and our digital tools. By welcoming as many perspectives as possible, we help you build a career where you feel like you belong.