Dl System Software Engineer - AI Platform

NVIDIA NVIDIA · Semiconductors · Toronto, ON

NVIDIA is seeking a DL System Software Engineer to join their AI Platform team. The role involves developing and building solutions for scheduling large-scale AI training and inference workloads on GPU clusters, optimizing performance and efficiency for large models. The engineer will work on core infrastructure, resource management, and GPU scheduling, contributing to NVIDIA's AI platform.

What you'd actually do

  1. Taking part in the development of the NVIDIA's AI platform for training, fine-tuning and serving latest and greatest AI models with the best performance and efficiency.
  2. Designing and building solutions for scheduling large scale AI training and inference workloads on GPU clusters over many cloud infrastructure.
  3. Exploring and finding solution for open problems like industry-scale resource management, GPU scheduling, performance prediction, and live workload migration.
  4. Work with and contribute to adjacent teams like TensorRT/Dynamo inference engine, ML compiler, KAI/Grove scheduler, Lepton cloud etc.

Skills

Required

  • Bachelor's degree or equivalent experience in Computer Science, Computer Engineering, relevant technical field
  • 5+ years of experience
  • Experience building large scale systems from scratch
  • Strong coding skills in programming languages like Python, Go, Rust and/or C/C++
  • Solid foundation in other computer science and computer engineering topics: algorithms and data structures, operating systems, computer architecture, etc.

Nice to have

  • Prior experience in container-based deployment systems like Kubernetes is beneficial
  • Strong understanding of AI and related technologies is a huge plus
  • Graduate-level education or relevant practical background, particularly in research, is beneficial
  • Practical experience in building and optimizing AI applications is highly desired
  • Proficiency in container software such as containerd, CRI-O, Linux namespace, CRIU, and NVIDIA GPU technology such as CUDA graphs, Driver/runtime is greatly advantageous

What the JD emphasized

  • 5+ years of experience
  • Experience building large scale systems from scratch

Other signals

  • building a unified solution that brings NVIDIA technologies into a single, cohesive platform
  • designing and building solutions for scheduling large scale AI training and inference workloads on GPU clusters
  • exploring and finding solution for open problems like industry-scale resource management, GPU scheduling, performance prediction, and live workload migration