Senior Systems Engineer, Storage - Dgx Cloud

NVIDIA NVIDIA · Semiconductors · CA +4 · Remote

Senior Systems Engineer role focused on building, automating, and operating large-scale production systems, specifically storage and data platforms on Kubernetes for NVIDIA's GPU cloud services. Responsibilities include designing, deploying, and operating Kubernetes solutions, building automation tools for storage lifecycle management, developing telemetry and observability, and applying analytical troubleshooting skills. Requires strong experience in Kubernetes, software design, observability tools, and programming languages like Python or Go.

What you'd actually do

  1. Design, deploy, and operate solutions on Kubernetes for large-scale storage and data platforms, including the manifests, Helm charts, and operators that run them.
  2. Build tools, services, and automation that improve the lifecycle of storage and data systems – from provisioning and configuration through deployment, scaling, and day-2 operations.
  3. Develop and operate telemetry and observability for production systems – metrics, logging, tracing, dashboards, and alerting – so that system health, availability, and latency are measurable and actionable.
  4. Apply strong analytical troubleshooting skills to diagnose and resolve complex issues across distributed, containerized infrastructure.
  5. Work closely with peers and partner teams to improve the lifecycle of services, from inception and design through deployment, operation, and refinement.

Skills

Required

  • BS degree in Computer Science or related technical field involving coding (or equivalent experience)
  • 12+ years of practical experience
  • Hands-on experience with Kubernetes – deploying, configuring, and operating workloads and solutions on Kubernetes in production
  • Experience building tools and services for storage, data, or platform infrastructure
  • Solid software design fundamentals (algorithms, data structures, complexity analysis) on large-scale Linux-based systems
  • Experience building and operating telemetry and observability using tools such as Prometheus, InfluxDB, Grafana, and the Elastic stack
  • Strong analytical troubleshooting skills with a systematic, root-cause-driven approach to identifying and resolving complex problems
  • Proficiency in one or more of the following: Python, Go, or Java
  • Good knowledge of infrastructure configuration management and infrastructure-as-code tools such as Ansible, Chef, Puppet, ArgoCD, Git Pipelines, and Terraform

Nice to have

  • Customer-first mindset with a focus on customer satisfaction and a passion for ensuring customer success
  • Experience with Git, code review, pipelines, and CI/CD
  • Experience using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker
  • Interest in crafting, analyzing, and fixing large-scale distributed systems, with strong debugging skills and a systematic problem-solving approach
  • Experience designing storage- or data-focused tooling and automating their operations at scale
  • Thrive in collaborative environments and enjoy working with various teams, and are flexible in adapting to different working styles

What the JD emphasized

  • large-scale storage and data platforms
  • Kubernetes
  • automation
  • telemetry and observability
  • analytical troubleshooting skills
  • distributed, containerized infrastructure
  • provisioning and configuration through deployment, scaling, and day-2 operations
  • system health, availability, and latency are measurable and actionable
  • complex issues
  • large-scale Linux-based systems
  • Prometheus, InfluxDB, Grafana, and the Elastic stack
  • systematic, root-cause-driven approach
  • large-scale distributed systems