Software Engineer, ML Performance and Scaling

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Software Engineer focused on optimizing the performance, throughput, and robustness of large-scale distributed ML systems, including implementing low-latency sampling, low-precision inference, and efficient serving algorithms.

What you'd actually do

  1. identifying these problems, and then developing systems that optimize the throughput and robustness of our largest distributed systems
  2. Implement low-latency high-throughput sampling for large language models
  3. Implement GPU kernels to adapt our models to low-precision inference
  4. Write a custom load-balancing algorithm to optimize serving efficiency
  5. Design and implement a fault-tolerant distributed system running with a complex network topology

Skills

Required

  • significant software engineering or machine learning experience
  • supercomputing scale
  • results-oriented
  • flexibility and impact
  • pick up slack
  • learn more about machine learning research
  • societal impacts of your work
  • Bachelor's degree in a related field or equivalent experience

Nice to have

  • High performance, large-scale ML systems
  • GPU/Accelerator programming
  • ML framework internals
  • OS internals
  • Language modeling with transformers
  • pair programming

What the JD emphasized

  • solving large-scale systems problems
  • supercomputing scale
  • High performance, large-scale ML systems
  • GPU/Accelerator programming
  • ML framework internals
  • OS internals
  • low-latency
  • high-throughput
  • low-precision inference
  • fault-tolerant distributed system

Other signals

  • optimize throughput and robustness of distributed systems
  • low-latency high-throughput sampling
  • low-precision inference
  • optimize serving efficiency
  • fault-tolerant distributed system