Senior Machine Learning Engineer, ML Training Platform

Reddit Reddit · Consumer · United States · Remote · Machine Learning

Senior Machine Learning Engineer focused on building and maintaining the ML training platform infrastructure at Reddit. This role involves architecting, implementing, and optimizing systems for large-scale ML workflows, including Kubernetes automation, GPU orchestration, and developer experience for internal ML engineers. The goal is to deliver a self-service ML platform that supports continuous iteration and improvement of ML systems.

What you'd actually do

  1. Lead the building, testing, and maintenance of ML training infrastructure at Reddit.
  2. Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows.
  3. Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows.
  4. Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance.
  5. GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully.

Skills

Required

  • Python
  • Go
  • Kubernetes
  • GPU
  • Cloud Provider Experience
  • distributed training frameworks
  • Ray
  • AWS
  • GCP
  • distributed systems
  • big data

Nice to have

  • Jupyter Ecosystem Knowledge

What the JD emphasized

  • Deep Kubernetes Expertise
  • GPU Experience

Other signals

  • ML infrastructure
  • ML platform
  • training infrastructure
  • large-scale machine learning workflows
  • Kubernetes
  • GPU orchestration