Machine Learning Systems Engineer - Data & Evaluation, Horizons

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Machine Learning Systems Engineer on the Horizons team, focusing on building software infrastructure for AI models to use tools effectively and measure performance. This involves extending the agent framework, creating evaluations, managing training data pipelines, and applying data science techniques to improve model capabilities. The role combines software development with empirical analysis to advance model performance and capabilities, working closely with research and production teams.

What you'd actually do

  1. Extend and improve our agent framework that enables models to interact with tools and environments
  2. Design and implement evaluation systems that rigorously measure model capabilities across tasks
  3. Build and maintain data pipelines for collecting, processing, and managing RL training data
  4. Develop dashboards and analysis tools to extract insights from model performance data
  5. Collaborate with researchers to translate evaluation needs into scalable, production-grade systems

Skills

Required

  • Python
  • data analysis libraries (Pandas, NumPy, etc.)
  • data visualizations
  • interactive dashboards
  • software engineering fundamentals
  • clean APIs
  • technical concepts communication
  • web development for interactive tools (JavaScript, React, etc.)

Nice to have

  • LLM specific evaluations and frameworks
  • data visualization libraries and frameworks (D3.js, Plotly, Grafana)
  • Jupyter ecosystem
  • notebook-based workflows
  • statistical methods
  • experimental design
  • web frameworks for building interactive applications (FastAPI, Flask)
  • large datasets
  • performance considerations
  • ML research papers
  • implementing metrics from academic literature

What the JD emphasized

  • agent framework
  • evaluations
  • data pipelines
  • model performance
  • reinforcement learning

Other signals

  • agent framework
  • evaluations
  • data pipelines
  • model performance
  • reinforcement learning