Senior Software Engineer, Cloud-native Stack – Csp Engagements

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +3

Senior Software Engineer role focused on the cloud-native stack for AI/ML datacenters, involving deep-dive debugging of multi-rack, multi-tenant clusters, prototyping feature extensions for Kubernetes and Slurm, and collaborating with customers and internal teams. Requires strong expertise in Kubernetes and Slurm internals, experience with next-gen GPUs, and a background in customer-facing engineering.

What you'd actually do

  1. Perform deep-dive debugging of multi-rack, multi-tenant clusters: scheduler behavior, container runtime issues, device-plugin crashes, RDMA/IB fabric anomalies, etc.
  2. Gather customer requirements and prototype feature extensions for Kubernetes operators, Slurm plugins, and custom micro-services that expose new GPU capabilities.
  3. Drive joint architecture reviews and “whiteboard” sessions with CSP and internal platform teams; convert findings into RFCs and upstream pull requests.
  4. Create reproducible testbeds (Helm/Ansible/Terraform) that mirror customer environments; automate validation and benchmark suites.
  5. Deliver technical collateral-design docs, how-to guides, demo scripts-and present at customer on-sites, KubeCon, and SlurmUG.

Skills

Required

  • Kubernetes internals (scheduler, CRI/CNI/CSI, operators)
  • Slurm (federation, power-save, plugins)
  • integrating next-gen GPUs (Blackwell/GB200/GB300)
  • debugging large-scale, cloud-native stacks across networking (RDMA/RoCE), storage, and control planes
  • customer-facing engineering or solutions-architect background: requirements gathering, PoC ownership, roadmap influence
  • CI/CD (GitHub Actions, Tekton)
  • observability (Prometheus, OpenTelemetry)
  • infrastructure-as-code
  • distributed systems (Go, Rust, C/C++ or Python for tooling)
  • BS or MS (or equivalent experience) in Computer Engineering, Computer Science, or related field

Nice to have

  • Upstream contributions to Kubernetes, Slurm, Volcano, or similar projects
  • GPU computing (CUDA)
  • deep learning workloads

What the JD emphasized

  • multi-rack, multi-tenant AI datacenters
  • Kubernetes + Slurm issues
  • complex scheduling challenges
  • Kubernetes internals (scheduler, CRI/CNI/CSI, operators)
  • Slurm (federation, power-save, plugins)
  • integrating next-gen GPUs (Blackwell/GB200/GB300)
  • large-scale, cloud-native stacks