ML Infrastructure Engineer, Safeguards

Anthropic Anthropic · AI Frontier · San Francisco, CA · AI Research & Engineering

ML Infrastructure Engineer focused on building and scaling critical infrastructure for AI safety systems, including real-time and batch classifier/safety evaluations, monitoring, and optimizing inference for safety-critical applications.

What you'd actually do

  1. Design and build scalable ML infrastructure to support real-time and batch classifier and safety evaluations across our model ecosystem
  2. Build monitoring and observability tools to track model performance, data quality, and system health for safety-critical applications
  3. Collaborate with research teams to productionize safety research, translating experimental safety techniques into robust, scalable systems
  4. Optimize inference latency and throughput for real-time safety evaluations while maintaining high reliability standards
  5. Implement automated testing, deployment, and rollback systems for ML models in production safety applications

Skills

Required

  • Python
  • PyTorch, TensorFlow, or JAX
  • cloud platforms (AWS, GCP)
  • Kubernetes
  • distributed systems principles
  • data engineering tools
  • Spark, Airflow, or streaming systems

Nice to have

  • large language models
  • transformer architectures
  • A/B testing frameworks
  • experimentation infrastructure
  • monitoring and alerting systems
  • automated labeling systems
  • human-in-the-loop workflows
  • trust & safety, fraud prevention, or content moderation domains
  • privacy-preserving ML techniques
  • compliance requirements
  • open-source ML infrastructure projects

What the JD emphasized

  • production ML infrastructure
  • safety-critical domains
  • high-throughput, low-latency workloads
  • reliability and impact in safety-critical systems

Other signals

  • ML infrastructure
  • AI safety systems
  • production ML infrastructure
  • safety-critical applications
  • inference latency and throughput