Dgx Cloud Performance Engineer

NVIDIA · Semiconductors · Bangalore, India +2

NVIDIA is seeking Parallel and Distributed Systems engineers to drive performance analysis, optimization, and modeling for their DGX Cloud AI platform. The role involves developing benchmarks, analyzing performance bottlenecks, and collaborating with AI researchers to improve system performance and usability. Expertise in large-scale parallel systems, AI workloads, performance modeling, and AI frameworks is required.

What you'd actually do

  1. Develop benchmarks, end to end customer applications running at scale, instrumented for performance measurements, tracking, sampling, to measure and optimize performance of important applications and services;
  2. Construct carefully designed experiments to analyze, study and develop critical insights into performance bottlenecks, dependencies, from an end to end perspective;
  3. Develop ideas on how to improve the end to end system performance and usability by driving changes in the HW or SW (or both).
  4. Collaborate with AI researchers, developers, and application service providers to understand internal developer and external customer pain points, requirements, project future needs and share best practice.
  5. Develop the necessary modeling framework and the TCO (total cost of ownership) analysis to enable efficient exploration and sweep of the architecture and design space

Skills

Required

  • Expertise in working with large scale parallel and distributed accelerator-based system systems
  • Expertise optimizing performance and AI workloads on large scale systems
  • Experience with performance modeling and benchmarking at scale
  • Strong background in Computer Architecture, Networking, Storage systems, Accelerators
  • Familiarity with popular AI frameworks (PyTorch, TensorFlow, JAX, Megatron-LM, Tensort-LLM, VLLM)
  • Experience with AI/ML models and workloads, in particular LLMs as well as an understanding of DNNs and their use in emerging AI/ML applications and services
  • Bachelors/Masters in Engineering or equivalent experience (preferably, Electrical Engineering, Computer Engineering, or Computer Science)
  • 10 years experience in the above areas
  • Proficiency in Python, C/C++
  • Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI, …)

Nice to have

  • PhD in the relevant areas
  • Proficiency in CUDA, XLA

What the JD emphasized

  • performance analysis
  • optimization
  • AI workloads
  • large scale systems
  • performance modeling
  • AI frameworks
  • AI/ML models
  • LLMs
  • DNNs

Other signals

  • performance analysis
  • optimization
  • AI workloads
  • large scale systems
  • AI infrastructure