Senior Software Engineer - Datacenter Systems

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +2 · Remote

Senior Software Engineer role focused on designing, building, and improving software systems for datacenter provisioning and management, including rack installation, networking, and cluster scaling. The role involves developing scalable release train architectures, defining and monitoring SLIs/SLOs/SLAs, building CI/CD pipelines, and automating software updates for high-performance GPU clusters running HPC and AI workloads. Requires strong programming skills in Python, Rust, C++, and experience with CI/CD tools and SRE practices.

What you'd actually do

  1. Develop and manage software for hands-off datacenter provisioning and lifecycle management, including rack installation, bare-metal networking configuration, and cluster scaling.
  2. Build and implement scalable release train architectures that modularize systems and enable independent, reliable release cycles.
  3. Define, monitor, and enforce Service Level Indicators (SLI), Objectives (SLO), and Agreements (SLA) for core infrastructure services to ensure high availability and reliability.
  4. Develop intuitive user interfaces (UIs) and APIs for internal provisioning and management tools, making cluster operations and visibility more straightforward.
  5. Lead the technical requirement definition process, clearly articulating requirements, inputs, outputs, and quantifiable outcomes for new infrastructure features and system improvements.

Skills

Required

  • Python
  • Rust
  • C++
  • Shell
  • CI/CD tools
  • Infrastructure-as-code frameworks
  • Linux
  • Networking
  • Distributed systems
  • Software programming
  • Systems management
  • Datacenter provisioning
  • Release train architectures
  • SRE practices

Nice to have

  • Jenkins
  • GitLab
  • Ansible
  • GitOps
  • Kubernetes
  • SLIs
  • SLOs
  • SLAs
  • Observability tools
  • Prometheus
  • Grafana
  • User-facing components
  • CLI
  • Cluster management tools
  • Slurm
  • NVIDIA DGX systems
  • GPU-based clusters

What the JD emphasized

  • 8+ years of experience managing infrastructure or systems in high-performance or distributed environments.
  • Expertise in software programming using Python, Rust, C++, and Shell or similar high-level languages.
  • Practical experience with modern CI/CD tools and infrastructure-as-code frameworks such as Jenkins, GitLab, Ansible, GitOps, and Kubernetes.
  • Strong understanding of Linux, networking, and distributed system building.
  • Ability to break down monolithic systems into scalable, loosely coupled components.