Aiml - Sr Data Scientist, Evaluation

Apple Apple · Big Tech · Cupertino, CA +1 · Machine Learning and AI

This role focuses on developing and researching evaluation methods to improve the quality of user-facing AI products like Siri and Apple Intelligence. It involves working with large datasets, applying advanced analytical methods including prompt engineering and using LLMs as judges, and partnering with engineering teams to translate methodological developments into production technologies. The goal is to guide product development and decisions through rigorous evaluation and data analysis, ultimately impacting products used by hundreds of millions globally.

What you'd actually do

  1. Research and develop evaluation methods to improve the quality of Apple user facing products, such as Siri and Apple Intelligence.
  2. Work with evaluation/experimentation engineering teams to get your methodological developments translated into technologies that product engineering will use every day.
  3. Work with large, complex data sets.
  4. Solve difficult, non-routine analysis problems, applying advanced analytical methods as needed, including prompt engineering and building LLMs as judges.
  5. Conduct analysis that includes data collection and quality control, requirements specification, processing and presentations.

Skills

Required

  • Advanced degree in a quantitative field such as Statistics, Operational Research, Bioinformatics, Economics, Psychology, Computer Science, Sociology, Mathematics, Physics, or similar quantitative field.
  • Proficiency in data science, machine learning, and analytics, including statistical data analysis.
  • Experience crafting, conducting, analyzing, and interpreting experiments and investigations, especially on data quality, evaluation and risk assessment.
  • Strong programming skills, including data-querying skills (SQL and/or Spark, etc.) and experience with a scripting language for data processing and development (e.g., Python, R, or Scala).
  • Experience articulating and translating business questions and using statistical techniques to arrive at an answer using available data.
  • Strong communication skills and the ability to naturally explain difficult technical topics (especially causal topics) to everyone from data scientists to engineers to business partners.

Nice to have

  • Proven ability to collaborate effectively across functions and work well within a team.
  • Capable of driving projects of varying sizes and scopes - some will take months and some weeks — and you will need to know when to dive deep.
  • Worked with methods that address accuracy and variability in human annotation data.
  • Ability to learn new technology and skills to accommodate changing working requirements.

What the JD emphasized

  • driving product impact via measurement and evaluation
  • improve search quality
  • evaluation methods
  • building LLMs as judges
  • user experience evaluations

Other signals

  • driving product impact via measurement and evaluation
  • improve search quality and guide feature development with data
  • Research and develop evaluation methods to improve the quality of Apple user facing products, such as Siri and Apple Intelligence
  • building LLMs as judges