Currently tracking 82 active AI roles, up 61% versus the prior 4 weeks. Primary focus: Agent · Engineering. Salary range $139k–$393k (avg $256k).
| Title | Stage | AI score |
|---|---|---|
| Research Scientist, Frontier Risk Evaluations Research Scientist role focused on designing and building evaluation measures, harnesses, and datasets for frontier AI systems, with a focus on identifying and mitigating risks. The role involves collaboration with external agencies and publishing findings, bridging AI research and policy. | Eval GateAgent | 9 |
| Research Scientist, AI Controls and Monitoring Research Scientist role focused on designing methods, systems, and experiments for AI controls and monitoring, ensuring advanced AI models and agents remain aligned with intended goals, even in high-stakes or adversarial environments. This includes developing monitoring techniques, researching layered control mechanisms, designing red-team simulations, and collaborating with policymakers. |
| 9 |
| Staff Machine Learning Research Scientist, LLM Evals Scale AI is seeking a Staff Machine Learning Research Scientist to lead the development of novel evaluation methodologies, metrics, and benchmarks for large language models (LLMs). This role focuses on defining and measuring the capabilities and limitations of frontier LLMs, driving research that informs internal roadmaps and the broader community. Responsibilities include researching existing evaluation techniques, designing new benchmarks, implementing scalable evaluation pipelines, publishing findings, and mentoring junior researchers. The ideal candidate has 5+ years of experience in LLMs/NLP, a strong publication record, and experience leading research teams. | Eval GatePost-train | 9 |
| Tech Lead/Manager, Machine Learning Research Scientist- LLM Evals Scale AI is seeking a Tech Lead/Manager for their LLM Evals Research team. This role involves leading a team to develop and implement novel evaluation methodologies, metrics, and benchmarks for large language models, focusing on areas like instruction following, factuality, robustness, and fairness. The position requires research into LLM evaluation techniques, communication with clients and internal teams, implementation of scalable evaluation pipelines, and publishing research findings. The ideal candidate has extensive experience in LLMs, NLP, and Transformer modeling, with a proven track record of research impact and team leadership. | Eval GatePost-train | 9 |
| Senior Machine Learning Engineer - Model Evaluations, Public Sector This role focuses on building and scaling automated evaluation pipelines for AI systems, including LLMs and agentic models, to ensure their reliability, safety, and effectiveness in mission-critical government environments. It involves designing test datasets, benchmarks, and frameworks for various metrics, including LLM-judge evaluations, agent testing, and stress tests. | Eval GateAgent | 8 |
| Product Manager, Public Sector GenAI Test & Evaluation (T&E) Product Manager for GenAI Test & Evaluation (T&E) in the Public Sector team at Scale AI. This role focuses on defining the vision and roadmap for evaluation capabilities, owning the T&E tech stack to measure and improve agentic applications. Requires strong engineering depth, experience with evaluation systems, problem distillation, ambiguity management, cross-functional leadership, and operational execution. Experience with GenAI implementation, public sector work, and security clearance are preferred. | Eval Gate | 7 |