Senior Systems Software Engineer, Kubernetes Node Lifecycle - Dgx Cloud

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

Senior Systems Software Engineer role focused on Kubernetes node lifecycle management for NVIDIA's DGX Cloud, which provides accelerated computing solutions for AI workloads. Responsibilities include building and refining CAPI providers, managing OS image pipelines, ensuring node security and compliance, and handling nodepool lifecycle at scale. Requires deep expertise in Kubernetes, CAPI, OS image systems, and cloud infrastructure.

What you'd actually do

  1. Direct the building and refinement of CAPI providers for NVIDIA Kubernetes Engine, maintaining steady, consistent, and scalable node provisioning across DGX Cloud and NCP environments.
  2. Develop and maintain bring-your-own-node workflows that allow customers to integrate different NVIDIA hardware into NKE clusters while ensuring high operational consistency.
  3. Coordinate OS image generation, packaging, deployment, and update processes for NKE nodes. Ensure images are fine-tuned for NVIDIA GPU workloads and satisfy enterprise- and cloud-grade security and compliance criteria.
  4. Develop and maintain node image hardening pipelines, incorporating CIS benchmarks, automated CVE remediation, and promotion gates connected to security posture.
  5. Develop and maintain automated test suites for node images. These tests verify accuracy across Kubernetes versions and NVIDIA hardware configurations. This process occurs prior to production deployment and facilitates continuous validation through modern CI/CD pipelines.

Skills

Required

  • 8 years of experience with a background in systems software, cloud infrastructure, or Kubernetes node engineering.
  • Bachelor’s or Master’s degree in Engineering (Electrical, Computer Engineering, Computer Science) or equivalent experience.
  • Deep expertise in Cluster API (CAPI), including provider development and full machine lifecycle from provisioning to deletion.
  • Extensive experience with OS image build pipelines, node image packaging, and delivery systems for Kubernetes nodes (for example image-builder, containerd, cloud-init, packer).
  • Practical experience with bring-your-own-node models and integrating diverse hardware into live Kubernetes environments, including large-scale nodepool lifecycle management and upgrades.
  • Strong understanding of kubelet configuration, node bootstrap, and the Kubernetes node registration lifecycle.
  • Experience with node image security, including vulnerability scanning, patch automation, and compliance gating as part of image build pipelines.
  • Proficiency in Golang and/or Python, and hands-on experience with at least one major public cloud provider (GCP, AWS, Azure, OCI or equivalent).

Nice to have

  • Direct experience building or maintaining node image pipelines for a hyperscaler Kubernetes distribution (GKE, EKS, AKS, OKE, or equivalent).
  • Experience with supply chain security and hardening for node images, including image signing, provenance attestation, SBOM generation, CIS benchmark consistency, and automated CVE remediation.
  • Experience with automated node provisioning and optimal sizing at scale (for example Karpenter, GKE NAP or similar) and how these interact with GPU workload scheduling.
  • Strong operational experience working with immutable OS image distributions (such as Flatcar, Bottlerocket, Azure Linux) and debugging node-layer failures in large Kubernetes clusters.
  • Proven background of upstream contributions to Cluster API, Kubernetes or related CNCF projects, combined with excellent communication and interpersonal abilities.

What the JD emphasized

  • deep hyperscaler-level knowledge across the entire node lifecycle
  • technical depth needed to maintain cluster reliability at frontier AI scale
  • Deep expertise in Cluster API (CAPI), including provider development and full machine lifecycle from provisioning to deletion.
  • Extensive experience with OS image build pipelines, node image packaging, and delivery systems for Kubernetes nodes
  • Practical experience with bring-your-own-node models and integrating diverse hardware into live Kubernetes environments, including large-scale nodepool lifecycle management and upgrades.
  • Strong understanding of kubelet configuration, node bootstrap, and the Kubernetes node registration lifecycle.
  • Experience with node image security, including vulnerability scanning, patch automation, and compliance gating as part of image build pipelines.