What you'd actually do

Lead the end-to-end design, implementation, and maintenance of a highly available, low-latency GPU-based model serving system for search, ranking, and LLMs supporting Millions of QPS.

Design and develop ML and Generative AI systems in cloud-based production environments on Kubernetes at scale.

Rapidly develop prototypes and develop a high-performance feature hydration and processing system as a part of the inference stack - including routing, caching, and batching.

Lead a unified GPU model export framework to support converting trained models into optimized GPU inference models.

Strong understanding of real-time ML observability to track feature/model performance.

Skills

Required

7+ years of experience in ML Engineering, AI Platform Engineering, or Cloud AI Deployment roles
Experience operating orchestration systems such as Kubernetes at scale
Deep experience with cloud-based technologies for supporting an ML platform, including tools like AWS, Google Cloud Storage, infrastructure-as-code (Terraform), and more
Proficiency with the common programming languages and frameworks of ML, such as Go, Python, etc.
Strong focus on scalability, reliability, performance, and ease of use
Strong proficiency in Python
deep experience with modern AI/ML frameworks (Triton, Dynamo, vLLM, Pytorch)

Nice to have

Strong knowledge of model serving, inference pipelines, monitoring, and observability for AI systems

Reddit is a community of communities. It’s built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 126 million daily active unique visitors, Reddit is one of the internet’s largest sources of information. For more information, visit www.redditinc.com.

**Who We Are: **** **The Machine Learning Platform team at Reddit is a high-impact team that owns the infrastructure that powers recommendations, content discovery, user and content quantification, while directly impacting other teams such as Growth, Ads, Feeds, and Core Machine Learning teams.

What You’ll Do:** **As a Staff Machine Learning Engineer, you will lead the development of a large-scale ML Inference Platform at Reddit.

Lead the end-to-end design, implementation, and maintenance of a highly available, low-latency GPU-based model serving system for search, ranking, and LLMs supporting Millions of QPS.
Design and develop ML and Generative AI systems in cloud-based production environments on Kubernetes at scale.
Rapidly develop prototypes and develop a high-performance feature hydration and processing system as a part of the inference stack - including routing, caching, and batching.
Lead a unified GPU model export framework to support converting trained models into optimized GPU inference models.
Strong understanding of real-time ML observability to track feature/model performance.
Experience working with LLM serving online at scale.
Built an E2E inference performance benchmarking framework
Deep Understanding of multi-cluster compute environment and network topology that is specific to ML inference use cases.

Who You Might Be:

7+ years of experience in ML Engineering, AI Platform Engineering, or Cloud AI Deployment roles.
Have experience operating orchestration systems such as Kubernetes at scale
Deep experience with cloud-based technologies for supporting an ML platform, including tools like AWS, Google Cloud Storage, infrastructure-as-code (Terraform), and more
Proficiency with the common programming languages and frameworks of ML, such as Go, Python, etc.
Excellent communication skills with the ability to articulate technical AI concepts to non-technical stakeholders
Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the genAI product development lifecycle.
Strong knowledge of model serving, inference pipelines, monitoring, and observability for AI systems is a plus
Strong proficiency in Python and deep experience with modern AI/ML frameworks (Triton, Dynamo, vLLM, Pytorch)

Benefits:

Comprehensive Healthcare Benefits and Income Replacement Programs
401k with Employer Match
Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
Family Planning Support
Gender-Affirming Care
Mental Health & Coaching Benefits
Flexible Vacation & Paid Volunteer Time Off
Generous Paid Parental Leave

Pay Transparency:

This job posting may span more than one career level.

In addition to base salary, this job is eligible to receive equity in the form of restricted stock units, and depending on the position offered, it may also be eligible to receive a commission. Additionally, Reddit offers a wide range of benefits to U.S.-based employees, including medical, dental, and vision insurance, 401(k) program with employer match, generous time off for vacation, and parental leave. To learn more, please visit https://www.redditinc.com/careers/.

To provide greater transparency to candidates, we share base salary ranges for all US-based job postings regardless of state. We set standard base pay ranges for all roles based on function, level, and country location, benchmarked against similar stage growth companies. Final offer amounts are determined by multiple factors including, skills, depth of work experience and relevant licenses/credentials, and may vary from the amounts listed below.

The base salary range for this position is:

$253,300—$354,600 USD

In select roles and locations, the interviews will be recorded, transcribed and summarized by artificial intelligence (AI). You will have the opportunity to opt out of recording, transcription and summarization prior to any scheduled interviews.

During the interview, we will collect the following categories of personal information: Identifiers, Professional and Employment-Related Information, Sensory Information (audio/video recording), and any other categories of personal information you choose to share with us. We will use this information to evaluate your application for employment or an independent contractor role, as applicable. We will not sell your personal information or disclose it to any third party for their marketing purposes. We will delete any recording of your interview promptly after making a hiring decision. For more information about how we will handle your personal information, including our retention of it, please refer to our Candidate Privacy Policy for Potential Employees and Contractors.

Reddit is proud to be an equal opportunity employer, and is committed to building a workforce representative of the diverse communities we serve. Reddit is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If, due to a disability, you need an accommodation during the interview process, please let your recruiter know.