Senior Developer Technology Engineer - Windows AI Platform

NVIDIA NVIDIA · Semiconductors · Singapore, Singapore

Senior Developer Technology Engineer focused on optimizing AI GPU deployment on the NVIDIA RTX platform for enterprise and consumer AI applications. This role involves profiling, debugging, training, and enhancing open-source LLM and GenAI software on Windows, collaborating with internal teams and external partners to improve performance and user experience.

What you'd actually do

  1. Work closely with internal engineering and product teams and external app developers on solving local end-to-end AI GPU deployment challenges on the NVIDIA RTX AI platform.
  2. Apply powerful profiling and debugging tools for analyzing most demanding GPU-accelerated end-to-end AI applications to detect insufficient GPU utilization resulting in suboptimal runtime performance.
  3. Conduct hands-on trainings, develop sample code and host presentations to give good guidance on efficient end-to-end AI deployment targeting optimal runtime performance on NVIDIA ARM-based SoCs.
  4. Improve Windows LLM & GenAI user experience on NVIDIA RTX by working on feature and performance enhancements of OSS software, including but not limited to projects like GGML, Llama.cpp, Ollama, ONNX Runtime.
  5. Collaborate with GPU driver and architecture teams as well as NVIDIA research to influence next generation GPU features by providing real-world workflows and giving feedback on partner and customer needs.

Skills

Required

  • 5+ years of professional experience in local GPU deployment, profiling and optimization
  • C/C++
  • Python
  • software design
  • programming techniques
  • Windows operating system development experience
  • open-source LLM and GenAI software experience
  • CUDA
  • NVIDIA's Nsight GPU profiling and debugging suite
  • problem-solving skills
  • independent and collaborative work
  • interpersonal and communication skills

Nice to have

  • GPU-accelerated AI inference driven by NVIDIA APIs, specifically cuDNN, CUTLASS, TensorRT
  • Vulkan and / or DX12
  • latest generation GPU architectures
  • AI deployment on NPUs and ARM architectures

What the JD emphasized

  • local end-to-end AI GPU deployment challenges
  • suboptimal runtime performance
  • efficient end-to-end AI deployment
  • optimal runtime performance
  • LLM & GenAI user experience
  • performance enhancements
  • real-world workflows

Other signals

  • deployment challenges
  • GPU utilization
  • runtime performance
  • LLM & GenAI user experience
  • OSS software
  • partner and customer needs