Senior Software Engineer - Deep Learning Compiler Verification and Infrastructure

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +5 · Remote

Senior Software Engineer focused on building and scaling CI/CD infrastructure and automation for deep learning compilers at NVIDIA. The role involves improving reliability, performance, and observability of ML workloads across diverse GPU environments, with an emphasis on practical AI applications to enhance CI processes.

What you'd actually do

  1. Drive CI and infrastructure capabilities that make deep learning compiler development fast, reliable, and scalable. This includes improving signal-to-noise (flake reduction, reproducibility, and richer diagnostics), accelerating iteration cycles, scaling capacity and coverage across models/hardware/software configurations, and building strong observability (metrics, logging, tracing, dashboards) so failures are easy to understand and fix.
  2. Explore practical uses of AI to enhance CI workflows—such as smarter test selection, automated triage/summarization, and faster issue isolation—ultimately increasing the quality and speed of deep learning compiler development, testing, and release.

Skills

Required

  • Python
  • CI/CD
  • MLOps
  • Observability
  • Deep learning frameworks
  • Linux

Nice to have

  • AI/LLMs for CI
  • Agent-based workflows
  • Compiler verification techniques
  • LLVM/MLIR

What the JD emphasized

  • 3+ years of professional experience designing and scaling CI/CD, build/release, or developer productivity infrastructure for DL/GPU software environments
  • Hands-on experience building CI/MLOps platform capabilities—pipeline orchestration, artifact/package management, and production-grade observability (logs/metrics/dashboards)—with strong reliability and maintainability
  • Experience with deep learning frameworks/runtime stacks (e.g., PyTorch, JAX, vLLM, SGLang, TensorRT, NeMo) and running real workloads in production-like environments

Other signals

  • CI/CD systems for ML workloads
  • Deep learning compiler development
  • Observability for ML systems
  • AI to enhance CI workflows