Designing benchmarks and automated scoring systems to measure model quality, safety, or capability — typically blending classical metrics, LLM-as-judge, and human review.
Primary AI lifecycle stage: evaluation.
As of today, 2,040 active AI roles across 208 companies in our index reference Evals. Hiring concentrates at the agents (57%) and evaluation (12%) stages. Most common sectors: Big Tech, Enterprise, AI Frontier.
Designing benchmarks and automated scoring systems to measure model quality, safety, or capability — typically blending classical metrics, LLM-as-judge, and human review. Primary AI lifecycle stage: evaluation.
2,040 active AI roles across 208 companies in our index reference Evals as of today.
The companies with the most active Evals listings are: Amazon (188 roles), Google (153 roles), OpenAI (95 roles), Microsoft (73 roles), JPMorgan Chase (70 roles).
Evals primarily belongs to the evaluation stage of the AI lifecycle. In current hiring, Evals roles concentrate at: agents (57%), evaluation (12%).
The sectors with the most active Evals hiring are: Big Tech, Enterprise, AI Frontier.
198 AI roles tagged evals.
| Company | Title | Sector | AI score | Other tags |
|---|---|---|---|---|
| Airbnb | Senior Machine Learning Engineer, Customer Support Engineering | Consumer | 9 | Agent orchestration · Tool use · Guardrails · RAG · Fine-tuning · Model serving · RLHF · Agent research |
| Senior Machine Learning Engineer, GenAI Security | Consumer | 9 | Agent orchestration · Tool use · Guardrails · Fine-tuning · Model serving | |
| Zillow | Senior Machine Learning Engineer | Consumer | 9 | Agent orchestration · Multimodal · Guardrails · LLM observability · Model serving |
| DoorDash | AI Research Fellowship, (Summer and Fall 2026) | Consumer | 9 | Agent orchestration · Tool use · Forecasting · Multimodal · Vision · Audio & speech · Frontier research · Synthetic data |
| Airbnb | Machine Learning Engineer, Customer Support Engineering | Consumer | 9 | Agent orchestration · Tool use · Guardrails · RAG · Fine-tuning · Model serving · RL post-training · Agent research |
| Principal Engineer, Agentic Engineering | Consumer | 9 | Agent orchestration · Agent research · Guardrails · LLM observability · Tool use | |
| Sr. Data Scientist, Responsible AI | Consumer | 9 | Guardrails · LLM observability · Agent research · Multimodal | |
| Zillow | Principal Machine Learning Engineer, Agentic AI | Consumer | 9 | Agent orchestration · Multimodal · Guardrails · LLM observability · Model serving · Agent research |
| Uber | 2026 PhD Research Intern, India | Consumer | 9 | Fine-tuning · RLHF · Agent research · Frontier research |
| Zillow | Principal Applied Scientist, Agentic AI | Consumer | 9 | RL post-training · RLHF · Reward modeling · Fine-tuning · Guardrails · Agent orchestration · Multimodal · Vector DB |
| Uber | Senior Research Scientist, Generative AI | Consumer | 9 | RL post-training · Fine-tuning · Frontier research · Vision |
| Zillow | Senior Applied Scientist, Agentic AI | Consumer | 9 | Agent orchestration · Tool use · Fine-tuning · LLM observability · Agent research |
| Machine Learning Engineer II, Computer Vision Applied Science | Consumer | 9 | Vision · Multimodal · Fine-tuning · RLHF · Model serving | |
| Roblox | Principal Machine Learning Engineer, Engineering Acceleration | Consumer | 9 | Agent orchestration · Agent research · Synthetic data · Fine-tuning · Model serving · Code gen |
| Uber | Senior Staff Machine Learning Engineer – Moonshot AI | Consumer | 9 | Multimodal · Vision · Audio & speech · LLM observability · Fine-tuning · RAG · Model serving · Recommender systems |
| Staff Research Engineer, Post-training & Evaluation | Consumer | 9 | Fine-tuning · LLM observability · Frontier research · RL post-training | |
| Uber | Principal Machine Learning Engineer - AV Labs | Consumer | 9 | Multimodal · Model serving |
| Uber | Senior Applied Scientist – AI Red Teaming & Model Risk | Consumer | 9 | Guardrails · Agent orchestration · Tool use · LLM observability · Agent research |
| Zillow | Distinguished Scientist | Consumer | 9 | Agent orchestration · Agent research · Multi-agent · Fine-tuning · RL post-training · LLM observability · Multimodal |
| Uber | Staff ML Engineer, Generative AI | Consumer | 9 | Agent orchestration · Tool use · Guardrails · LLM observability · RAG · Fine-tuning · Model serving · Multimodal · Audio & speech |
| Zillow | AI Applied Scientist - PhD Intern, Generative Computer Vision | Consumer | 9 | Vision · Multimodal · Fine-tuning |
| Zillow | AI Applied Scientist - PhD Intern, Foundational IQ | Consumer | 9 | Fine-tuning · Multimodal · Agent orchestration |
| Zillow | AI Applied Scientist - PhD Intern, 3D Computer Vision | Consumer | 9 | Vision · Multimodal · Fine-tuning |
| Airbnb | Senior Staff Machine Learning Engineer, Data & Eval | Consumer | 9 | LLM observability · Guardrails · RAG · Agent orchestration · Tool use · Fine-tuning · Synthetic data |
| Instacart | Machine Learning Engineer, PhD Intern | Consumer | 9 | LLM observability · RAG · Fine-tuning · Inference infra · Model serving · Recommender systems · Search & ranking · Agent research |
| DoorDash | Software Engineer, Machine Learning Infrastructure - Gen AI | Consumer | 8 | Agent orchestration · Tool use · Guardrails · LLM observability · RAG · Vector DB · Fine-tuning · Inference infra · Model serving |
| Chegg | Senior Software Engineer - Agentic AI Applications | Consumer | 8 | Agent orchestration · Tool use · Guardrails · LLM observability · RAG · Vector DB · Fine-tuning · Model serving · Multimodal |
| Roblox | Senior Data Scientist - Generative AI | Consumer | 8 | LLM observability · Fine-tuning · Agent orchestration · Multimodal · Code gen |
| Duolingo | Senior AI Engineering Manager | Consumer | 8 | Recommender systems · Fine-tuning · Model serving · LLM observability |
| Duolingo | Senior AI Engineering Manager | Consumer | 8 | Recommender systems · Fine-tuning · Model serving · LLM observability |