Senior Software Engineer, Distributed Systems Engineer - Dgx Cloud

NVIDIA NVIDIA · Semiconductors · NC +1 · Remote

Senior Software Engineer role focused on scaling NVIDIA's AI Infrastructure, specifically DGX Cloud. The role involves working with Kubernetes, custom software for scheduling GPU resources, and implementing monitoring for reliability and scalability of GPU assets for AI workloads. Requires strong software engineering experience with Kubernetes and distributed systems.

What you'd actually do

  1. You will be part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads. This includes working on custom software related to scheduling GPU resources on kubernetes.
  2. Implementing monitoring and health management capabilities that enable industry leading reliability, availability, and scalability of GPU assets. You will be harnessing multiple data streams, ranging from GPU hardware diagnostics to cluster and network telemetry.
  3. Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance. Evaluating system failures and improving services based on a well-defined incident management process.

Skills

Required

  • Kubernetes
  • GPU resource scheduling
  • distributed systems
  • software engineering
  • Go
  • Python
  • data structures
  • algorithms
  • large-scale production systems

Nice to have

  • operator development
  • node health monitoring
  • Slurm
  • Bright Cluster Manager
  • managing and automating large-scale distributed systems independent of cloud providers

What the JD emphasized

  • custom software related to scheduling GPU resources on kubernetes
  • GPU resource scheduling
  • large scalable GPU clusters
  • AI workloads
  • production systems
  • Kubernetes
  • distributed systems
  • AI infrastructure

Other signals

  • GPU resource scheduling
  • AI workloads
  • production systems
  • distributed systems
  • Kubernetes