Senior AI Infrastructure Software Engineer - Dgx Cloud

NVIDIA · Semiconductors · Santa Clara, CA +3 · Remote

NVIDIA is seeking a Senior AI Infrastructure Software Engineer to design, build, and maintain AI platforms for large-scale AI training, inferencing, fine-tuning, and Agentic AI in production. The role involves developing platform and tools for AI/ML workload efficiency, resiliency, and observability, with a focus on distributed systems and Kubernetes.

What you'd actually do

  1. Develop platform and tools for large-scale AI, LLM, and GenAI infrastructure.
  2. Develop and optimize tools to improve AI/ML workload efficiency and resiliency.
  3. Root cause and analyze and triage failures from the application level to the hardware level
  4. Enhance infrastructure and products underpinning NVIDIA's AI platforms.
  5. Co-design and implement APIs for integration with NVIDIA's resiliency stacks on the platform.

Skills

Required

  • Python
  • C/C++
  • script languages
  • Kubernetes
  • observability platforms
  • monitoring
  • logging
  • distributed systems
  • debugging
  • root cause analysis
  • AI training
  • AI inferencing
  • data infrastructure services

Nice to have

  • large scale AI cluster
  • cloud-native infrastructure
  • NVIDIA GPUs
  • network technologies (RDMA, IB, NCCL)
  • DL frameworks (PyTorch, TensorFlow, JAX, Dynamo, Ray)
  • datacenter scale failure analysis

What the JD emphasized

  • Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems.
  • Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
  • Proven track record in building and scaling large-scale distributed systems.
  • Experience with AI training and inferencing and data infrastructure services.

Other signals

  • AI infrastructure
  • large-scale AI training
  • inferencing
  • fine-tuning
  • Agentic AI
  • production
  • distributed systems
  • Kubernetes
  • observability