Senior Dgx Cloud AI Infrastructure Software Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +4 · Remote

NVIDIA is seeking a Senior DGX Cloud AI Infrastructure Software Engineer to develop and optimize infrastructure software and tools for large-scale AI training, post-training, and inference. The role focuses on improving efficiency and resiliency of AI workloads, co-designing APIs, and enhancing AI platforms, requiring strong debugging and distributed systems experience.

What you'd actually do

  1. Develop infrastructure software and tools for large-scale pre-training, post-training, and inference.
  2. Develop and optimize tools and libraries to improve infrastructure efficiency and resiliency.
  3. Co-design and implement APIs for integration with NVIDIA's resiliency stacks.
  4. Enhance infrastructure and products underpinning NVIDIA's AI platforms.
  5. Define meaningful and actionable reliability metrics to track and improve system and service reliability.

Skills

Required

  • developing software infrastructure for large scale AI systems
  • debugging skills
  • analyzing and triaging AI applications from the application level to the hardware level
  • observability platforms for monitoring and logging
  • building and scaling large-scale distributed systems
  • AI training and inferencing infrastructure services
  • Python
  • C/C++
  • script languages
  • quality software engineering practices
  • test development
  • defensive programming
  • version control
  • CI
  • communication and collaboration skills

Nice to have

  • working with the large scale clusters
  • defining and building observability and telemetry software stack
  • RDMA software stack (NCCL, IB verbs, ucx, libfabrics)
  • root cause analysis of failures and datacenter scale
  • DL frameworks internal PyTorch, TensorFlow, JAX, and Ray

What the JD emphasized

  • large scale AI systems
  • AI training and inferencing infrastructure services
  • large scale clusters
  • datacenter scale

Other signals

  • large-scale AI training and inferencing
  • infrastructure software and tools
  • efficiency and resiliency