Principal Cloud Services Software Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

NVIDIA DGX Cloud Team is seeking a Principal Cloud Services Software Engineer to develop and optimize AI infrastructure services for large-scale AI training workflows. The role involves designing and implementing resilient, efficient services orchestrated by Kubernetes, with a focus on backend development, distributed systems, and high-performance computing.

What you'd actually do

  1. Developing solutions at the intersection of machine learning, distributed systems, and high-performance computing, supplying to the advancement of AI technologies.
  2. Designing, developing, and optimizing (micro-)services orchestrated by Kubernetes to provide large-scale AI training workflows on AI training supercomputers located at major CSPs, with resiliency and efficiency.
  3. Co-designing and implementing the APIs that allow these services to integrate vertically with NVIDIA's resiliency stacks, ranging from tier-0 telemetry services to break/fix automation services to checkpoint and execution systems.
  4. Crafting a submission abstraction that enables model engineers and training platforms/frameworks to seamlessly submit long-running training jobs while hiding the complexity of handling infrastructure failures, running job lifecycles with auto-restarts on failure, ensuring full efficiency, and promptly advising users.
  5. Crafting these services to be modular, enabling them to be coordinated with and deployed onto on-premises AI clusters that apply NVIDIA Hardware and Cloud services.

Skills

Required

  • Backend development
  • Python
  • Go
  • C/C++
  • High-performance languages
  • Large-scale distributed systems
  • Cloud computing platforms (AWS, Azure, GCP)
  • Container technologies (Docker, Kubernetes)
  • HPC/AI platforms (Slurm)

Nice to have

  • DL frameworks
  • Orchestrators (PyTorch, TensorFlow, JAX, Ray)
  • Framework plugin architecture
  • NVIDIA GPUs
  • Network technologies
  • AI models
  • AI based tools

What the JD emphasized

  • 15+ years of hands-on experience in backend development
  • Consistent track record of building and scaling large-scale distributed systems

Other signals

  • ML workloads
  • AI infrastructure tools
  • AI platform
  • large-scale distributed systems
  • AI training workflows
  • long-running training jobs