Senior System Software Engineer - Devops and Infrastructure Automation

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +2 · Remote

Senior System Software Engineer on NVIDIA's AI Inference Operations Team, focusing on DevOps and Infrastructure Automation. The role involves designing, building, and operating the infrastructure backbone for AI inference products, managing Kubernetes deployments, architecting CI/CD pipelines, building observability, managing cloud and on-prem environments with IaC, owning the security posture, and collaborating with other engineering teams.

What you'd actually do

  1. Design, build, and operate the infrastructure backbone powering AI inference products — reliable, performant, and scalable at every layer!
  2. Own Kubernetes deployments end-to-end across cloud and on-prem: runbooks, canary checks, post-deploy validation, and rollbacks when needed.
  3. Architect CI/CD pipelines for automated build, test, packaging, and release of inference libraries and their container-based software stacks.
  4. Build observability that actually tells the truth about platform health — dashboards, logs, metrics, automated checks — and lead first-level incident triage with clean, actionable handoffs to engineering.
  5. Manage cloud and on-prem environments with infrastructure-as-code (Terraform, Ansible, Helm, Crossplane), and chip away at toil using GitHub Actions, GitLab CI, and custom tooling.

Skills

Required

  • BS/MS in CS/CE or equivalent experience
  • 7+ years operating production distributed systems (SRE / DevOps / Platform Ops)
  • Deep Kubernetes expertise
  • Strong CI/CD chops (GitLab CI, GitHub Actions)
  • Linux systems programming
  • Scripting in Python and Bash
  • IaC fluency (Terraform, Ansible, Helm, Crossplane)
  • Containerization depth (Docker, containerd, OCI)
  • Reliability ownership (SLOs/SLIs, on-call, incident response, post-incident reviews)
  • Observability stacks (Prometheus, Grafana, Loki)
  • Clear communicator

Nice to have

  • MLOps experience
  • Experience in open-source development workflows and community engagement on projects like Triton Inference Server or ONNX Runtime
  • Familiarity with GPU software stacks (CUDA, cuDNN, TensorRT, and inference serving frameworks)
  • Experience building custom test automation frameworks
  • Using data-driven metrics to improve platform health and developer efficiency
  • Demonstrated ability to debug complex issues spanning kernel modules, container runtimes, and distributed networking

What the JD emphasized

  • 7+ years operating production distributed systems (SRE / DevOps / Platform Ops)
  • Deep Kubernetes expertise
  • Strong CI/CD chops
  • IaC fluency
  • Proven reliability ownership