Deep Learning Kernel Software Performance Architect

NVIDIA NVIDIA · Semiconductors · Shanghai, China +1

This role focuses on optimizing GPU kernel performance for deep learning workloads by building automated, data-driven workflows to detect, explain, and prevent performance regressions. The engineer will be responsible for performance analysis, debugging, and developing Python-based automation for performance testing and analysis, collaborating with various engineering teams.

What you'd actually do

  1. Validate and analyze performance of GPU-accelerated kernels and key deep learning building blocks.
  2. Debug performance issues end-to-end: reproduce, isolate root causes, propose fixes or mitigation paths, and drive closure with the owning teams.
  3. Build performance narratives using structured evidence: baselines, controlled comparisons, and regression attribution.
  4. Develop and maintain Python-based automation for performance testing and analysis—using modern AI-assisted developer tools (e.g., Cursor/Claude Code/Copilot) to accelerate scripting while keeping code maintainable and reviewable.
  5. Design and operate performance test workflows: coverage definition, test/workload generation, automated large-scale execution (CI/nightly/on-demand), rerun rules, and reproducibility standards.

Skills

Required

  • Masters or PhD degree or equivalent experience in Computer Science, Computer Engineering, Applied Math, or related field
  • Strong programming ability in Python plus C/C++
  • Solid fundamentals in computer architecture and performance reasoning
  • Experience with performance analysis workflows: profiling, measurement methodology, reproducibility, and regression triage.
  • Comfortable working across teams and driving issues to decision/closure with clear communication
  • Demonstrated strong C++ programming and software design skills, including debugging, performance analysis, and test design
  • Experience with performance-oriented parallel programming
  • Solid understanding of computer architecture
  • Identify bottlenecks, optimize resource utilization, and improve throughput

Nice to have

  • Experience with high-performance kernels or math libraries (e.g., GEMM/attention, CUTLASS-like concepts)
  • Experience building CI/nightly regression systems, dashboards, or large-scale performance analytics
  • GPU programming/perf experience (CUDA or equivalent parallel programming)
  • Strong ML/DL workload understanding (training/inference shapes, precision modes, perf bottlenecks)
  • Familiarity with simulators/analytical modeling or performance characterization methodology

What the JD emphasized

  • performance analysis
  • debugging
  • performance regressions
  • Python-based automation
  • performance testing
  • performance analysis workflows
  • performance-oriented code reading/debugging
  • performance reasoning
  • performance analysis
  • performance-oriented parallel programming
  • performance characterization methodology