Software Engineer, Inference Scalability and Capability

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Software Engineer focused on building and scaling inference systems for LLMs, optimizing performance, reliability, and compute efficiency. This role involves tackling complex distributed systems challenges across the inference stack, from request routing to caching, and supporting new model architectures and inference features.

What you'd actually do

  1. Optimizing inference request routing to maximize compute efficiency
  2. Autoscaling our compute fleet to effectively match compute supply with inference demand
  3. Contributing to new inference features (e.g. structured sampling, fine tuning)
  4. Supporting inference for new model architectures
  5. Ensuring smooth and regular deployment of inference services

Skills

Required

  • significant software engineering experience
  • high performance, large-scale distributed systems
  • Python

Nice to have

  • results-oriented, with a bias towards flexibility and impact
  • pick up slack, even if it goes outside your job description
  • enjoy pair programming
  • want to learn more about machine learning research
  • care about the societal impacts of your work
  • implementing and deploying machine learning systems at scale
  • LLM optimization batching and caching strategies
  • Kubernetes

What the JD emphasized

  • significant software engineering experience
  • high performance, large-scale distributed systems
  • implementing and deploying machine learning systems at scale
  • LLM optimization batching and caching strategies

Other signals

  • building and maintaining critical systems that serve LLMs
  • scaling inference systems
  • ensuring reliability
  • optimizing compute resource efficiency
  • developing new inference capabilities
  • distributed systems challenges across inference stack
  • optimal request routing
  • efficient prompt caching
  • implementing and deploying ML systems at scale
  • LLM optimization batching and caching strategies
  • autoscaling compute fleet
  • supporting inference for new model architectures
  • smooth and regular deployment of inference services
  • analyzing observability data to tune performance