Staff Machine Learning Engineer, AI Serving

Reddit Reddit · Consumer · San Francisco, CA · Machine Learning

Staff Machine Learning Engineer focused on leading the development of a large-scale, highly available, low-latency GPU-based model serving system for search, ranking, and LLMs, supporting millions of QPS. The role involves designing and developing ML and Generative AI systems in cloud-based production environments on Kubernetes, building high-performance feature hydration and processing systems, and leading a unified GPU model export framework. Requires strong understanding of real-time ML observability and experience with LLM serving online at scale.

What you'd actually do

  1. Lead the end-to-end design, implementation, and maintenance of a highly available, low-latency GPU-based model serving system for search, ranking, and LLMs supporting Millions of QPS.
  2. Design and develop ML and Generative AI systems in cloud-based production environments on Kubernetes at scale.
  3. Rapidly develop prototypes and develop a high-performance feature hydration and processing system as a part of the inference stack - including routing, caching, and batching.
  4. Lead a unified GPU model export framework to support converting trained models into optimized GPU inference models.
  5. Strong understanding of real-time ML observability to track feature/model performance.

Skills

Required

  • 7+ years of experience in ML Engineering, AI Platform Engineering, or Cloud AI Deployment roles
  • Experience operating orchestration systems such as Kubernetes at scale
  • Deep experience with cloud-based technologies for supporting an ML platform, including tools like AWS, Google Cloud Storage, infrastructure-as-code (Terraform), and more
  • Proficiency with the common programming languages and frameworks of ML, such as Go, Python, etc.
  • Strong focus on scalability, reliability, performance, and ease of use
  • Strong proficiency in Python
  • deep experience with modern AI/ML frameworks (Triton, Dynamo, vLLM, Pytorch)

Nice to have

  • Strong knowledge of model serving, inference pipelines, monitoring, and observability for AI systems

What the JD emphasized

  • highly available
  • low-latency
  • GPU-based model serving system
  • Millions of QPS
  • Kubernetes at scale
  • LLM serving online at scale
  • E2E inference performance benchmarking framework

Other signals

  • ML Inference Platform
  • GPU-based model serving
  • low-latency
  • Millions of QPS
  • Kubernetes at scale
  • LLM serving online at scale