Machine Learning Engineer, Ml/genai Evaluation

Apple Apple · Big Tech · Austin, TX +3 · Software and Services

Machine Learning Engineer focused on evaluating ML and GenAI models for Wallet, Payments, and Commerce features. This role defines evaluation criteria, metrics frameworks, and quality standards, designs adversarial test strategies, and owns the model quality sign-off process to ensure models meet high standards for accuracy, robustness, fairness, and reliability before shipping to hundreds of millions of users. Responsibilities include building test sets, developing robustness testing methodologies, owning fairness evaluation end-to-end, evaluating generative model outputs, and synthesizing results for product decisions.

What you'd actually do

  1. Define evaluation criteria and quality metrics for ML models powering Wallet features
  2. Own fairness evaluation end-to-end — define fairness metrics appropriate to each Wallet feature, build bias test suites across protected attributes and user populations, measure disparate performance across subgroups, and gate model launches on fairness criteria with the same rigor as other conventional metrics.
  3. Evaluate generative and agentic model outputs — assessing hallucination rates, faithfulness, and groundedness using LLM-as-a-judge frameworks, human evaluation protocols, and prompt regression testing
  4. Own model quality sign-off — establish the launch criteria, run final evaluations, and make the call on model readiness before any feature ships
  5. Partner with ML engineers and Quality engineers to identify failure modes early in the development cycle and close the loop between evaluation findings and model improvements

Skills

Required

  • 5+ years of hands-on ML experience
  • deep expertise in model evaluation
  • offline metrics design
  • behavioral testing
  • Strong track record designing evaluation frameworks for production ML systems
  • precision-recall tradeoffs
  • calibration
  • fairness
  • task-specific quality dimensions
  • Creative mindset with the ability to translate standard ML evaluation metrics (F1, AUC, etc.) into utility and user trust measures
  • Experience testing for distribution shift, out-of-distribution generalization, and temporal drift in real-world deployed models
  • Proven ability to construct adversarial test suites, aggressor scenarios, and edge-case corpora that surface model failure modes before they reach users
  • Strong programming skills in Python
  • fluency with evaluation tooling, data pipelines, and experiment tracking (e.g., MLflow, W&B, or equivalent)
  • Excellent communication skills — ability to translate metric results into product-quality narratives for engineering and executive audiences
  • Experience owning model quality sign-off in a cross-functional launch process

Nice to have

  • M.S. in Machine Learning, Computer Science, Statistics, Applied Mathematics, or a related technical field strongly preferred
  • Experience with structured and semi-structured document understanding, OCR pipelines, or financial data extraction is a strong plus
  • PhD in Computer Science, Data Science, Statistics, AI/ML, or a related field
  • Experience with Bayesian or causal graph-based approaches to data generation
  • Experience with causal approaches to fairness evaluation — counterfactual fairness, causal Shapley values, or structural causal model–based bias auditing
  • Experience evaluating models under privacy constraints or on-device inference settings is a plus
  • Familiarity with confidence calibration techniques and uncertainty quantification a plus
  • Background in financial services, fintech, or consumer payment products

What the JD emphasized

  • rigorous evaluation
  • holding models accountable to fairness standards
  • how you measure a model is just as important as how you train it
  • hold quality standards others find uncomfortably high
  • fairness
  • model quality sign-off
  • rigor

Other signals

  • Defining evaluation criteria, metrics frameworks, and quality standards for ML models
  • Designing adversarial test strategies and surfacing failure modes
  • Owning the sign-off process for model quality, accuracy, robustness, and reliability
  • Evaluating generative and agentic model outputs using LLM-as-a-judge frameworks
  • Ensuring fairness and bias testing across protected attributes and user populations