GPU Performance Engineer - Neural Reconstruction

NVIDIA NVIDIA · Semiconductors · Canada · Remote

GPU Performance Engineer focused on optimizing neural reconstruction and Gaussian Splatting workloads, involving PyTorch, CUDA, and GPU profiling to improve training and rendering performance.

What you'd actually do

  1. Profile end-to-end neural reconstruction workflows and identify bottlenecks across data loading, initialization, training, rendering, evaluation, and export.
  2. Improve CUDA and PyTorch performance for Gaussian Splatting and neural reconstruction workloads, including camera/lidar data, multiview batching, large-scene rendering, and memory-sensitive training paths.
  3. Analyze GPU performance using tools such as Nsight Systems, Nsight Compute, NVTX, PyTorch Profiler, CUDA events, and benchmark dashboards.
  4. Optimize sparse and irregular rendering workloads, including tile-level masking/culling, sparse gradients, batching, and multi-GPU execution.
  5. Translate high-impact Python, NumPy, or PyTorch bottlenecks into efficient CUDA/C++ or PyTorch-native implementations when appropriate.

Skills

Required

  • Python
  • C++
  • PyTorch
  • CUDA
  • GPU profiling
  • Performance analysis
  • Benchmarking
  • Validation of optimizations

Nice to have

  • Gaussian Splatting
  • NeRF
  • Differentiable rendering
  • Rasterization
  • Neural rendering
  • SLAM
  • 3D reconstruction
  • Robotics perception
  • Autonomous-vehicle perception
  • Deep CUDA performance
  • Sparse tensors
  • Distributed training
  • Distributed rendering
  • Camera and lidar geometry
  • Projection models
  • Calibration
  • Rolling shutter
  • Depth rendering
  • Multi-sensor reconstruction
  • Large production ML systems

What the JD emphasized

  • BS, MS, PhD, or equivalent experience in Computer Science, Computer Engineering, Electrical Engineering, Applied Math, Robotics, Computer Vision, Machine Learning, or a related field along with 12+ years of experience.
  • Strong programming skills in Python and C++!
  • Hands-on experience with PyTorch or a similar tensor/autograd framework.
  • Experience optimizing GPU-accelerated workloads using CUDA, C++/CUDA extensions, or related GPU programming approaches.
  • Practical experience with profiling and performance analysis, including root-causing CPU/GPU bottlenecks, synchronization overhead, memory pressure, kernel launch overhead, and framework-level inefficiencies.
  • Ability to develop benchmarks and validate that optimizations preserve correctness, numerical behavior, and user-visible quality.
  • Strong communication skills, including the ability to explain performance tradeoffs, risks, and results to research and engineering partners.

Other signals

  • GPU performance optimization
  • CUDA and PyTorch performance
  • Neural reconstruction and Gaussian Splatting workloads
  • Optimize training and rendering workflows