Staff Machine Learning Engineer, ML Efficiency

Reddit Reddit · Consumer · London, United Kingdom · Ads Engineering

Staff Machine Learning Engineer focused on ML Efficiency at Reddit, building infrastructure and tooling to optimize ML training and inference workloads, improve resource utilization, and reduce costs. The role involves designing and building systems, developing debugging and profiling tools, optimizing resource utilization, partnering with ML teams, building benchmarking frameworks, optimizing distributed training and serving architectures, and driving technical strategy for ML platform scalability and cost efficiency.

What you'd actually do

  1. Design and build systems that improve the efficiency of ML training and inference workloads.
  2. Develop tooling that helps ML engineers debug, profile, optimize, and monitor model performance.
  3. Improve GPU and general resource utilization through scheduling, resource management, caching, and workload optimization.
  4. Partner with ML researchers and product teams to identify bottlenecks and drive performance improvements.
  5. Build benchmarking frameworks and performance dashboards for training and serving systems.

Skills

Required

  • Python
  • distributed systems
  • machine learning infrastructure
  • training systems
  • model serving platforms
  • performance engineering
  • systems optimization
  • debugging
  • profiling

Nice to have

  • Go
  • C++
  • Rust
  • Java
  • recommendation systems
  • ranking systems
  • generative AI
  • foundation models
  • PyTorch Distributed
  • Ray
  • Tensorflow
  • Spark
  • GPU architectures
  • cloud infrastructure cost optimization
  • real time ML inference applications

What the JD emphasized

  • machine learning infrastructure
  • training systems
  • model serving platforms
  • performance engineering
  • systems optimization
  • distributed systems

Other signals

  • ML infrastructure
  • training systems
  • model serving platforms
  • performance engineering
  • systems optimization
  • distributed systems