Senior Developer Technology Engineer

NVIDIA NVIDIA · Semiconductors · Beijing, China +1

This role focuses on optimizing GPU-accelerated code for training and inference performance of large-scale recommender systems. It involves designing and implementing high-performance C++/CUDA components, developing tests, and optimizing data flows between GPUs, NICs, and SSDs. The ideal candidate has experience with C++, CUDA, Python, GPU performance profiling, and ideally, building or optimizing recommender systems or production ML workloads on GPUs.

What you'd actually do

  1. Profile, analyze, and optimize GPU‑accelerated code to improve training and inference performance for large‑scale recommender systems.
  2. Design, implement, and maintain high‑performance C++/CUDA components within our core recommendation framework.
  3. Develop and execute tests (unit, integration, and performance) to ensure numerical correctness, stability, and regression prevention in GPU workloads.
  4. Collaborate closely with CUDA and ML engineers to interpret profiling results, refine designs, and implement optimization strategies.
  5. Design and optimize high‑throughput data flows between GPUs, RDMA‑capable NICs, and NVMe SSDs using technologies such as GPUDirect RDMA and GPUDirect Storage.

Skills

Required

  • C++
  • CUDA
  • Python
  • Linux
  • GPU performance profiling
  • computational pipeline optimization

Nice to have

  • recommender systems
  • production ML workloads on GPUs
  • deep learning frameworks (PyTorch, TensorFlow, JAX)
  • distributed or multi-GPU training
  • NCCL
  • MPI
  • RDMA
  • high-speed data movement
  • high-performance storage pipelines
  • GPUDirect Storage
  • NVMe-oF

What the JD emphasized

  • 3+ years of experience in C++, CUDA, and Python development on Linux systems.
  • Proven ability to diagnose and optimize computational pipelines using profiling tools such as Nsight Systems or nvprof.
  • Relevant experience building or optimizing large‑scale recommender systems or production ML workloads on GPUs.

Other signals

  • GPU-accelerated recommendation tools
  • training and inference performance
  • large-scale recommender systems
  • high-performance C++/CUDA components
  • high-throughput data flows