Senior Software Engineer - Deep Learning Compiler Ci Infrastructure

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +3

Senior Software Engineer to own and evolve CI/CD infrastructure for NVIDIA's deep learning compiler stacks. Responsibilities include designing and operating scalable CI systems for ML workloads, delivering performance signals, and applying AI/agent-based workflows to improve developer efficiency and triage.

What you'd actually do

  1. Build, maintain, and improve CI infrastructure that supports development, verification, and release of NVIDIA’s deep learning compiler stacks across GPU and accelerator environments
  2. Improve CI reliability and signal quality by reducing flakes, improving reproducibility, strengthening diagnostics, and making correctness and performance failures easier to understand and act on
  3. Apply automation, AI, and agent-based workflows to reduce manual CI operations, speed up failure triage, and improve developer efficiency
  4. Build reusable and self-service CI platforms that support multiple products, projects, model suites, hardware targets, and software configurations while partnering closely with compiler, infrastructure, and release teams

Skills

Required

  • 5+ years of experience designing, scaling, and operating CI/CD, build/release, or developer infrastructure for complex software systems
  • Proven experience building CI platforms end-to-end using systems such as GitLab CI, GitHub Actions, Jenkins, or similar tools, including pipeline orchestration, compute/runner management, artifact and package systems, and observability, with strong emphasis on reliability, reproducibility, and debuggability
  • Strong software engineering skills (Python required)
  • Ability to design, implement, and debug distributed systems end-to-end
  • Proven track record of designing, building, and deploying AI/LLM-based systems in real engineering workflows

Nice to have

  • Experience crafting and shipping sophisticated AI/agent-based systems that improve continuous integration or developer efficiency
  • Experience operating CI for DL/GPU software environments, including multi-GPU / multi-node workloads on Slurm, Kubernetes, or cloud platforms
  • Familiarity with compiler IRs and infrastructure such as LLVM/MLIR, XLA/HLO, Triton IR, cuTile, or TileIR

What the JD emphasized

  • Proven track record of designing, building, and deploying AI/LLM-based systems in real engineering workflows
  • Experience crafting and shipping sophisticated AI/agent-based systems that improve continuous integration or developer efficiency

Other signals

  • CI/CD infrastructure for deep learning compilers
  • Orchestrating ML workloads
  • AI/agent-based workflows for CI
  • Shipping AI/LLM-based systems in engineering workflows