Staff Machine Learning Engineer, ML Efficiency

Reddit Reddit · Consumer · London, United Kingdom · Ads Engineering

Staff Machine Learning Engineer focused on ML efficiency at Reddit, building infrastructure and tooling to optimize ML training and inference workloads, improve resource utilization, and reduce costs. The role involves designing and building systems for efficient ML operations, developing debugging and profiling tools, optimizing GPU utilization, and driving technical strategy for ML platform scalability and cost efficiency.

What you'd actually do

  1. Design and build systems that improve the efficiency of ML training and inference workloads.
  2. Develop tooling that helps ML engineers debug, profile, optimize, and monitor model performance.
  3. Improve GPU and general resource utilization through scheduling, resource management, caching, and workload optimization.
  4. Partner with ML researchers and product teams to identify bottlenecks and drive performance improvements.
  5. Build benchmarking frameworks and performance dashboards for training and serving systems.

Skills

Required

  • Python
  • systems language (Go, C++, Rust, or Java)
  • building distributed systems at scale
  • machine learning infrastructure, training systems, or model serving platforms
  • performance engineering and systems optimization
  • debugging and profiling skills

Nice to have

  • large-scale recommendation, ranking, generative AI, or foundation model systems
  • distributed training frameworks such as PyTorch Distributed, Ray, Tensorflow, Spark
  • GPU architectures and performance analysis tools
  • optimizing cloud infrastructure costs across large ML workloads
  • Contributions to internal platforms used by multiple ML teams
  • building real time ML inference applications

What the JD emphasized

  • machine learning infrastructure, training systems, or model serving platforms
  • performance engineering and systems optimization
  • distributed training frameworks such as PyTorch Distributed, Ray, Tensorflow, Spark
  • optimizing cloud infrastructure costs across large ML workloads
  • building real time ML inference applications

Other signals

  • ML Efficiency team builds the infrastructure, tooling, and optimization systems that enable machine learning engineers and researchers to train, evaluate, deploy, and operate models efficiently at scale.
  • focus on improving developer productivity, reducing infrastructure costs, increasing hardware utilization, and accelerating experimentation across the company’s ML ecosystem.
  • Design and build systems that improve the efficiency of ML training and inference workloads.
  • Optimize distributed training infrastructure, data pipelines, and model serving architectures.