Software Engineer, Systems ML

Meta Meta · Big Tech · Bellevue, WA +2

Software Engineer specializing in Systems Machine Learning to design and build infrastructure for large-scale AI systems, focusing on training efficiency, model serving, distributed computation, and hardware-software co-design. The role involves translating ML research into production systems and optimizing performance at scale.

What you'd actually do

  1. Design and implement scalable systems for distributed ML training and inference, including optimizations across compute, memory, and communication bottlenecks
  2. Develop and evaluate novel techniques for accelerating AI research workflows such as training, inference, RL, evals on latest generation hardware platforms
  3. Lead the architecture and end-to-end delivery of major systems ML initiatives, coordinating across research scientists, product engineers, and external partners
  4. Establish performance benchmarking frameworks and profiling pipelines to identify bottlenecks and drive measurable improvements in training throughput and inference latency
  5. Define service level objectives and reliability standards for ML training and serving systems, building dashboards and runbooks to reduce incident response time

Skills

Required

  • Systems engineering
  • Machine learning infrastructure
  • Distributed ML training systems
  • Distributed ML inference systems
  • PyTorch
  • JAX
  • TensorFlow
  • C++
  • CUDA
  • Performance profiling
  • Kernel optimization
  • Compiler-level ML optimizations
  • Technical design and delivery of complex systems
  • Data-driven methods and experimentation
  • ML compiler stacks (MLIR, XLA, TVM, or Triton)
  • Hardware-software co-design for AI accelerators
  • Automated tooling or frameworks for ML infrastructure
  • Model parallelism strategies (tensor parallelism, pipeline parallelism, expert parallelism)

Nice to have

  • Master's or PhD degree in Computer Science, Electrical Engineering, Machine Learning, or a related technical field
  • Prompt/context engineering
  • Agent orchestration
  • Responsible, ethical AI practices (risk assessment, bias mitigation, quality and accuracy reviews)

What the JD emphasized

  • 8+ years of experience in systems engineering, machine learning infrastructure, or a closely related field
  • Experience designing and optimizing distributed ML training or inference systems at scale
  • Experience with low-level systems programming in C++ or CUDA, including performance profiling, kernel optimization, or compiler-level ML optimizations
  • Experience leading the technical design and delivery of complex, cross-functional systems ML projects from inception through production deployment
  • Track record of publishing research on systems ML topics at venues such as MLSys, OSDI, SOSP, NeurIPS, or ICML

Other signals

  • large-scale AI systems
  • ML research and systems engineering
  • training efficiency
  • model serving
  • distributed computation
  • hardware-software co-design
  • production-grade systems
  • massive scale
  • AI-driven products