NVIDIA is seeking Software Performance Architects to optimize GPU kernel performance for state-of-the-art data-center platforms. We build automated, data-driven workflows to detect, explain, and prevent performance regressions across key deep learning workloads, partnering closely with kernel developers, compiler teams, infrastructure, and architecture/performance groups.
What you'll be doing:
Performance analysis + debugging
- Validate and analyze performance of GPU-accelerated kernels and key deep learning building blocks.
- Debug performance issues end-to-end: reproduce, isolate root causes, propose fixes or mitigation paths, and drive closure with the owning teams.
- Build performance narratives using structured evidence: baselines, controlled comparisons, and regression attribution.
Automation + regression infrastructure (Python-heavy)
- Develop and maintain Python-based automation for performance testing and analysis—using modern AI-assisted developer tools (e.g., Cursor/Claude Code/Copilot) to accelerate scripting while keeping code maintainable and reviewable.
- Design and operate performance test workflows: coverage definition, test/workload generation, automated large-scale execution (CI/nightly/on-demand), rerun rules, and reproducibility standards.
- Convert raw run outputs into actionable insight: statistics, noise control, post-processing, visualization, and large-scale result mining.
Cross-team collaboration and operating model
- Work with kernel developers and compiler/rotation teams to ensure performance checks are practical, scalable, and aligned to release needs.
- Partner with SWQA and infrastructure teams for execution at scale and reliable pipelines/dashboards.
- Contribute to clear ownership/triage/routing rules so regressions close quickly and consistently
Following general software engineering best practices including support for regression testing and CI/CD flows
What we need to see:
- Masters or PhD degree or equivalent experience in Computer Science, Computer Engineering, Applied Math, or related field
- Strong programming ability in Python plus C/C++ (performance-oriented code reading/debugging)
- Solid fundamentals in computer architecture and performance reasoning (latency/throughput, memory hierarchy, parallelism).
- Experience with performance analysis workflows: profiling, measurement methodology, reproducibility, and regression triage.
- Comfortable working across teams and driving issues to decision/closure with clear communication
- Demonstrated strong C++ programming and software design skills, including debugging, performance analysis, and test design
- Experience with performance-oriented parallel programming, even if it’s not on GPUs (e.g. with OpenMP or pthreads)
- Solid understanding of computer architecture and some experience with assembly programming
- Identify bottlenecks, optimize resource utilization, and improve throughput
Ways to stand out from the crowd:
- Experience with high-performance kernels or math libraries (e.g., GEMM/attention, CUTLASS-like concepts)
- Experience building CI/nightly regression systems, dashboards, or large-scale performance analytics
- GPU programming/perf experience (CUDA or equivalent parallel programming)
- Strong ML/DL workload understanding (training/inference shapes, precision modes, perf bottlenecks)
- Familiarity with simulators/analytical modeling or performance characterization methodology