Senior Software Architect, AI Systems and Networking

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +2 · Remote

This role focuses on building and optimizing systems-level software for high-performance communication and memory management libraries essential for distributed AI workloads. It involves hardware-software co-optimization, profiling data movement, and integrating networking capabilities into AI serving stacks, bridging applied research and production engineering.

What you'd actually do

  1. Architecting and implementing high-performance communication and memory management libraries for distributed AI
  2. Driving hardware-software co-optimization with GPU, DPU, NIC, and switch teams through GPUDirect RDMA, NVLink, and next-generation interconnects
  3. Profiling and optimizing data movement across GPU memory, system DRAM, NVMe, and network fabrics
  4. Integrating networking capabilities into AI serving stacks such as vLLM, SGLang, and TensorRT-LLM
  5. Contributing to and maintaining open-source projects, mentoring engineers, conducting design reviews, and prototyping experimental technologies to evaluate their viability

Skills

Required

  • systems software
  • networking
  • high-performance networking
  • InfiniBand
  • RoCE
  • RDMA
  • NVLink
  • GPUDirect
  • C/C++/Rust systems programming
  • performance profiling
  • low-level debugging
  • ML systems concepts
  • transformer architectures
  • KV cache mechanics
  • model parallelism
  • distributed training
  • inference patterns

Nice to have

  • ML inference frameworks
  • vLLM
  • SGLang
  • TensorRT-LLM
  • storage networking
  • NVMe-oF
  • GPUDirect Storage
  • S3
  • Reinforcement Learning systems

What the JD emphasized

  • shipping production code
  • complex projects
  • performance profiling
  • low-level debugging
  • ML inference frameworks

Other signals

  • shipping production code
  • systems-level software
  • low-level transport optimization
  • hardware-software co-design
  • communication frameworks
  • distributed AI
  • ML systems concepts