Senior AI Infrastructure Engineer - Dgx Cloud

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Senior AI Infrastructure Engineer responsible for designing, building, and maintaining large-scale production systems for NVIDIA's DGX Cloud, focusing on AI training and inferencing platforms. This role involves infrastructure automation, distributed systems, performance characterization, and ensuring reliability and availability of GPU cloud services.

What you'd actually do

  1. Design, build, deploy, and run internal tooling for large-scale AI training and inferencing platform built on top of cloud infrastructure.
  2. Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  3. Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
  4. Support services before they become available through activities such as system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
  5. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.

Skills

Required

  • Python
  • Go
  • C/C++
  • Java
  • Linux
  • Networking
  • Storage
  • Containers Technologies
  • Public Cloud
  • Infrastructure as a Code (IAAC)
  • Terraform
  • Distributed system experience

Nice to have

  • Kubernetes
  • Slurm

What the JD emphasized

  • 5+ years of experience
  • large-scale AI training and inferencing platform
  • large multi-GPU and multi-node clusters
  • large-scale private or public cloud platforms
  • Kubernetes or Slurm

Other signals

  • large-scale AI training and inferencing platform
  • multi-GPU and multi-node clusters
  • large-scale private or public cloud platforms