Senior Deep Learning Systems Engineer, Datacenters

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

Senior Deep Learning Systems Engineer focused on analyzing and optimizing the performance and power consumption of deep learning applications on datacenter hardware, influencing the design of future AI systems and software stacks. This role involves developing software infrastructure, analysis tools, and profiling methodologies for DL workloads, with a strong emphasis on system architecture and performance analysis.

What you'd actually do

  1. Help develop software infrastructure to characterize and analyze a broad range Deep Learning applications
  2. Evolve cost-efficient datacenter architectures tailored to meet the needs of Large Language Models (LLMs).
  3. Work with experts to help develop analysis and profiling tools in Python, bash and C++ to measure key performance metrics of DL workloads running on Nvidia systems.
  4. Analyze system and software characteristics of DL applications.
  5. Develop analysis tools and methodologies to measure key performance metrics and to estimate potential for efficiency improvement.

Skills

Required

  • Bachelor’s degree in Electrical Engineering or Computer Science or equivalent experience
  • 8 years or more of relevant experience
  • System Software: Operating Systems (Linux), Compilers, GPU kernels (CUDA), DL Frameworks (PyTorch, TensorFlow)
  • Silicon Architecture and Performance Modeling/Analysis: CPU, GPU, Memory or Network Architecture
  • programming in C/C++ and Python
  • deep understanding of computer system architecture and performance analysis
  • demonstrated hands-on experience in these domains

Nice to have

  • Masters or PhD degree preferred
  • Exposure to Containerization Platforms (docker)
  • Datacenter Workload Managers (slurm)
  • virtual environments
  • silicon performance monitoring or profiling tools (e.g. perf, gprof, nvidia-smi, dcgm)
  • performance modeling experience in any one of CPU, GPU, Memory or Network Architecture
  • multi-site teams or multi-functional teams

What the JD emphasized

  • deep understanding of computer system architecture and performance analysis is essential for success in this role
  • demonstrated hands-on experience in these domains

Other signals

  • Analyze performance and power consumption of deep learning applications on datacenter-class hardware
  • Develop analysis tools and methodologies to measure key performance metrics
  • Optimize next generation systems and Deep Learning Software Stack