Senior ML Platform Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +5 · Remote

Senior ML Platform Engineer at NVIDIA responsible for architecting, building, and scaling high-performance ML infrastructure using Infrastructure-as-Code (IaC) practices. The role focuses on creating reliable, automated platforms for training and deploying advanced ML models on GPU systems, applying SRE principles, and developing internal automation for ML workflows. Requires strong software engineering skills in Python/Go, experience with Kubernetes/Docker, and a solid understanding of ML workflows.

What you'd actually do

  1. Design, build, and maintain our core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters.
  2. Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads.
  3. Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, with a strong focus on software engineering best practices.
  4. Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline their end-to-end experimentation.
  5. Evolve and operate our multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols.

Skills

Required

  • BS/MS in Computer Science, Engineering, or equivalent experience
  • 5+ years in software/platform engineering or SRE roles
  • 3+ years focused on ML infrastructure or distributed compute systems
  • Strong proficiency in Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform
  • SRE principles
  • Diagnosing system-level issues
  • Performance tuning
  • Platform reliability
  • Solid understanding of ML workflows and lifecycle
  • Kubernetes
  • Docker
  • Python
  • Go
  • Linux systems internals
  • Networking
  • Performance tuning at scale

Nice to have

  • Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale
  • Deep understanding of distributed training techniques (e.g., data/model parallelism, Horovod, NCCL)
  • Expertise with modern CI/CD methodologies and GitOps practices
  • Proven ability to contribute code to complex orchestration or automation platforms

What the JD emphasized

  • 3+ years focused on ML infrastructure or distributed compute systems
  • Strong proficiency in Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform, with a proven track record of building and managing production infrastructure.
  • SRE-oriented mindset with extensive experience in diagnosing system-level issues, performance tuning, and ensuring platform reliability.
  • Solid understanding of ML workflows and lifecycle—from data preprocessing to deployment.
  • Proficiency in operating containerized workloads with Kubernetes and Docker.
  • Strong software engineering skills in languages such as Python or Go, with a focus on automation, tooling, and writing production-grade code.
  • Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale.
  • Deep understanding of distributed training techniques (e.g., data/model parallelism, Horovod, NCCL).

Other signals

  • ML infrastructure
  • GPU systems
  • train and deploy
  • SRE
  • software engineering
  • automation
  • orchestration
  • Kubernetes
  • Docker
  • Python
  • Go
  • Linux
  • PyTorch
  • TensorFlow
  • distributed training