Software Engineer — GPU Networking & Distributed Systems

Baseten · Data AI · San Francisco, CA · EPD

Software Engineer focused on GPU Networking and Distributed Systems to optimize AI inference infrastructure, specifically for LLMs and multi-modal models. The role involves integrating RDMA, optimizing networking layers for disaggregated KV cache and WideEP, enabling fast startup speeds, and building observability tools for bleeding-edge hardware.

What you'd actually do

  1. Make RDMA First-Class: You will work on integrating RDMA/RoCE/InfiniBand capabilities directly into our inference stack, helping us move beyond TCP/IP to unlock order-of-magnitude improvements in bandwidth and latency.
  2. Optimize Distributed Inference: You will implement and tune the networking layers necessary for efficient Disaggregated KV Cache Offload and WideEP, ensuring seamless communication across NVLink and InfiniBand for our MoE models.
  3. Enable Serverless-Grade Startup Speeds for LLMs: You will work deeply with checkpointing and storage mechanisms to enable sub-10-second startup for trillion-parameter models.
  4. Deep-Dive into Hardware: You will characterize and validate networking performance on bleeding-edge clusters (H100/H200, B200/B300, GB200/300 NVL72), writing the acceptance tests that ensure our hardware delivers peak achievable throughput and minimal latency.
  5. Build Observability: You will design the tools that let us visualize packet flow, congestion, and effective bandwidth across the GPU interconnects, helping us diagnose complex distributed system behaviors.

Skills

Required

  • high-performance networking protocols (InfiniBand, RoCE v2)
  • C++ or Python
  • understanding of memory hierarchy in modern NVIDIA architectures (H100/Blackwell)
  • debugging distributed systems

Nice to have

  • NCCL
  • NVSHMEM
  • UCX
  • GPUDirect Storage (GDS)
  • Weka
  • 3FS
  • TensorRT-LLM
  • vLLM
  • Sglang
  • low-level benchmarks

What the JD emphasized

  • Make RDMA First-Class
  • Optimize Distributed Inference
  • Enable Serverless-Grade Startup Speeds for LLMs
  • Deep-Dive into Hardware
  • Build Observability
  • high-performance networking protocols
  • InfiniBand
  • RoCE v2
  • memory hierarchy
  • NVIDIA architectures
  • H100
  • Blackwell
  • B200
  • B300
  • Rubin
  • NVL72
  • GB300

Other signals

  • GPU Networking
  • Distributed Inference
  • RDMA
  • LLM Optimization