Machine Learning Engineer

Apple Apple · Big Tech · London, United Kingdom · Machine Learning and AI

Machine Learning Engineer focused on Evaluation & Insights for the Human-Centered AI team. This role involves architecting evaluation frameworks, designing MLOps pipelines for model assessment, and translating qualitative failure modes into programmatic guardrails and training signals for Foundation Models and generative AI systems. The role also involves collaborating with various teams to ensure AI experiences are reliable, safe, and aligned with human expectations.

What you'd actually do

  1. Lead Rigorous Model Evaluations: Architect and execute comprehensive evaluation suites for LLMs and multimodal models, identifying edge cases in multi-step reasoning, factuality, adversarial robustness, safety, and alignment.
  2. Advanced Scoring Frameworks: Develop deterministic, heuristic, and LLM-assisted evaluation frameworks (e.g., LLM-as-a-judge, reward modeling) to quantify human-perceived quality metrics (e.g., helpfulness, hallucination rates).
  3. Actionable Signal Extraction: Translate qualitative failure modes into quantifiable loss patterns, programmatic guardrails, and actionable data-mixture adjustments for model training and inference.
  4. Improve Performance: Partner with engineering teams to refine model behavior, leveraging evaluation telemetry to inform prompt engineering, Retrieval-Augmented Generation (RAG) strategies, and model fine-tuning.
  5. Latent Pattern Recognition: Apply advanced ML techniques (e.g., embedding-based clustering, representation learning, perturbation analysis) to systematically map error taxonomies and latent failure manifolds in model outputs.

Skills

Required

  • Python
  • PyTorch
  • JAX
  • Hugging Face
  • interpret unstructured model outputs
  • synthesize qualitative findings into actionable engineering guidance and training objectives
  • developing LLMs
  • fine-tuning LLMs
  • evaluating LLMs
  • developing multimodal models
  • fine-tuning multimodal models
  • evaluating multimodal models
  • developing NLP systems
  • fine-tuning NLP systems
  • evaluating NLP systems
  • AI quality metrics
  • hallucination detection techniques
  • model alignment
  • RLHF
  • DPO
  • LLM-as-a-judge frameworks

Nice to have

  • human factors
  • HCI
  • cognitive science methodologies
  • building scalable ML inference pipelines
  • building model-evaluation workflows
  • building structured rating frameworks
  • building internal tools for ML workflows
  • building automated pipelines for ML workflows
  • MLflow
  • Weights & Biases
  • advanced prompt engineering
  • RAG architectures
  • vector databases
  • semantic search
  • Fine-Tuning

What the JD emphasized

  • evaluate AI models
  • evaluate and optimize Foundation Models
  • evaluate and optimize generative AI systems
  • evaluate Foundation Models
  • evaluate generative AI systems
  • evaluate LLMs
  • evaluate multimodal models
  • evaluate frameworks
  • evaluate evaluation frameworks
  • evaluate evaluation suites
  • evaluate human-perceived quality metrics
  • evaluate model behavior
  • evaluate model outputs
  • evaluate model checkpoints
  • evaluate human-centric assessments
  • evaluate automated annotation
  • evaluate outputs
  • evaluate outputs at scale
  • evaluate model evaluation infrastructure
  • evaluating LLMs
  • evaluating multimodal models
  • evaluating NLP systems
  • evaluating AI quality metrics
  • evaluating LLM-as-a-judge frameworks
  • evaluating model-evaluation workflows
  • evaluating structured rating frameworks
  • evaluating ML workflows

Other signals

  • evaluating foundation models
  • architecting evaluation frameworks
  • MLOps pipelines for model assessment
  • translating qualitative failure modes into programmatic guardrails and training signals