Senior Systems Software Engineer, Kubernetes Scale - Dgx Cloud

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

Senior Systems Software Engineer focused on scaling NVIDIA DGX Cloud's AI infrastructure, specifically optimizing Kubernetes and distributed inference serving for performance, cost, and reliability. The role involves end-to-end performance characterization, developing automated tests for AI workloads, debugging complex distributed systems, and contributing to open-source communities.

What you'd actually do

  1. Drive end-to-end performance and scale characterization for the NVIDIA DGX Cloud software stack, from Kubernetes control and data planes through NVIDIA components such as GPU Operator, Network Operator, DCGM, NIM, and distributed inference serving, following issues from orchestration down to the metal.
  2. Collaborate with AI researchers, developers and customers to develop innovative, automated tests that simulate real user workloads using custom-built and leading open-source tools and frameworks.
  3. Deep dive into performance and scale issues in complex distributed systems, including interactions between Kubernetes and the NVIDIA software stack, to identify and resolve root causes.
  4. Design and develop monitoring, reporting and analysis tools for performance and scale testing across software, GPU and CPU resources.
  5. Triage, debug and root cause issues related to operating Kubernetes clusters at ultra-large scale, ensuring reliability and efficiency.

Skills

Required

  • Kubernetes
  • distributed systems
  • systems performance and scalability
  • Golang
  • Python
  • NVIDIA software ecosystem (training and inference)
  • public CSP infrastructure (GCP, AWS, Azure, OCI)
  • performance modeling
  • benchmarking
  • Computer Architecture
  • Networking
  • Storage systems
  • Accelerators

Nice to have

  • Kubernetes distributions
  • open-source community
  • PhD in relevant areas

What the JD emphasized

  • deep experience in distributed systems
  • strong background in systems performance and scalability
  • broad, end-to-end experience across the stack
  • technical depth to investigate and address exciting, real-world problems at scale
  • scaling AI infrastructure
  • optimizing total cost of ownership
  • driving down cost per token
  • distributed inference serving
  • performance and scale characterization
  • complex distributed systems
  • operating Kubernetes clusters at ultra-large scale
  • performance and scale testing
  • performance modeling and benchmarking at scale
  • large scale parallel and distributed accelerator-based systems
  • scaling Kubernetes clusters to ultra-large node and object counts

Other signals

  • scaling AI infrastructure
  • optimizing total cost of ownership
  • driving down cost per token
  • distributed inference serving
  • Kubernetes scale