Currently tracking 82 active AI roles, up 61% versus the prior 4 weeks. Primary focus: Agent · Engineering. Salary range $139k–$393k (avg $256k).
Data AI · Data labeling
| Title | Stage | AI score |
|---|---|---|
| Research Scientist, Frontier Risk Evaluations Research Scientist role focused on designing and building evaluation measures, harnesses, and datasets for frontier AI systems, with a focus on identifying and mitigating risks. The role involves collaboration with external agencies and publishing findings, bridging AI research and policy. | Eval GateAgent | 9 |
| Research Scientist, AI Controls and Monitoring Research Scientist role focused on designing methods, systems, and experiments for AI controls and monitoring, ensuring advanced AI models and agents remain aligned with intended goals, even in high-stakes or adversarial environments. This includes developing monitoring techniques, researching layered control mechanisms, designing red-team simulations, and collaborating with policymakers. |
| 9 |
| Staff Machine Learning Research Scientist, LLM Evals Scale AI is seeking a Staff Machine Learning Research Scientist to lead the development of novel evaluation methodologies, metrics, and benchmarks for large language models (LLMs). This role focuses on defining and measuring the capabilities and limitations of frontier LLMs, driving research that informs internal roadmaps and the broader community. Responsibilities include researching existing evaluation techniques, designing new benchmarks, implementing scalable evaluation pipelines, publishing findings, and mentoring junior researchers. The ideal candidate has 5+ years of experience in LLMs/NLP, a strong publication record, and experience leading research teams. | Eval GatePost-train | 9 |
| Tech Lead/Manager, Machine Learning Research Scientist- LLM Evals Scale AI is seeking a Tech Lead/Manager for their LLM Evals Research team. This role involves leading a team to develop and implement novel evaluation methodologies, metrics, and benchmarks for large language models, focusing on areas like instruction following, factuality, robustness, and fairness. The position requires research into LLM evaluation techniques, communication with clients and internal teams, implementation of scalable evaluation pipelines, and publishing research findings. The ideal candidate has extensive experience in LLMs, NLP, and Transformer modeling, with a proven track record of research impact and team leadership. | Eval GatePost-train | 9 |
| SWE Fellow - Human Frontier Collective (Canada) This role focuses on evaluating and interpreting advanced generative AI systems, designing datasets for rigorous evaluation, and contributing to research publications. It involves collaborating with AI labs and platforms to enhance model accuracy and reasoning, positioning it as a key player in the AI evaluation gate. | Eval Gate | 8 |
| SWE Fellow - Human Frontier Collective (UK) This role is for a Software Engineer Fellow focused on evaluating and interpreting advanced generative AI systems within the Human Frontier Collective program. The fellow will design datasets, evaluate AI models, and contribute to research publications, aiming to enhance AI accuracy and reasoning. | Eval Gate | 8 |
| SWE Fellow - Human Frontier Collective (US) This role focuses on evaluating advanced generative AI systems by designing domain-specific problems and datasets, providing expert insights to enhance model performance, and contributing to research publications. It's a research-oriented role within a fellowship program aimed at shaping the future of AI. | Eval Gate | 8 |
| Senior Machine Learning Engineer - Model Evaluations, Public Sector This role focuses on building and scaling automated evaluation pipelines for AI systems, including LLMs and agentic models, to ensure their reliability, safety, and effectiveness in mission-critical government environments. It involves designing test datasets, benchmarks, and frameworks for various metrics, including LLM-judge evaluations, agent testing, and stress tests. | Eval GateAgent | 8 |
| STEM Fellow - Human Frontier Collective (UK) This role focuses on evaluating and interpreting advanced generative AI systems by designing domain-specific problems and datasets, providing expert insights to enhance model performance, and contributing to research publications. It's a research-oriented fellowship with a focus on AI evaluation. | Eval Gate | 8 |
| STEM Fellow - Human Frontier Collective (US) This role focuses on evaluating and interpreting advanced generative AI systems by designing domain-specific problems and datasets, providing expert insights to enhance model performance, and contributing to research publications. It involves collaboration with AI labs and Scale's research team. | Eval GatePost-train | 8 |
| Product Manager, Public Sector GenAI Test & Evaluation (T&E) Product Manager for GenAI Test & Evaluation (T&E) in the Public Sector team at Scale AI. This role focuses on defining the vision and roadmap for evaluation capabilities, owning the T&E tech stack to measure and improve agentic applications. Requires strong engineering depth, experience with evaluation systems, problem distillation, ambiguity management, cross-functional leadership, and operational execution. Experience with GenAI implementation, public sector work, and security clearance are preferred. | Eval Gate | 7 |
| Medical Fellow - Human Frontier Collective (UK) This role involves a medical professional (MD, DO) collaborating with AI labs to evaluate generative AI systems, focusing on safety, accuracy, and reasoning frameworks within healthcare. The goal is to apply clinical expertise to shape AI decision-making and contribute to research publications. The role is a 6-month independent contractor position with potential for extension. | Eval GatePost-train | 7 |
| Medical Fellow - Human Frontier Collective (US) This role is for a Medical Fellow to collaborate on high-impact projects evaluating and interpreting advanced generative AI systems in healthcare. The fellow will design clinical scenarios, evaluate model safety and accuracy, shape reasoning frameworks, and contribute to research publications. Prior AI experience is not required but is a plus. | Eval Gate | 7 |