Principal Software Engineer, GPU Compute

Roblox Roblox · Consumer · San Mateo, CA · Software Engineering

Principal Software Engineer on the Compute team responsible for Roblox's GPU and AI accelerator capabilities, focusing on the machine management layer and above. This role involves owning GPU host provisioning, driver and firmware management, GPU health, reliability, and performance across a large fleet of accelerators, and architecting how GPU capacity is exposed to compute platforms for AI workloads.

What you'd actually do

  1. Serve as the GPU technical leader for the Compute team, partnering across Kubernetes, Machine Bootstrap, Networking, and Cloud to drive GPU strategy end to end.
  2. Own the GPU host lifecycle above raw fleet management: driver, firmware, and CUDA stack management, GPU health and telemetry, and remediation of GPU-specific failures (XID errors, ECC, thermal, NVLink and fabric faults).
  3. Architect how GPU capacity is exposed to compute platforms, including scheduling, isolation, and integration with Kubernetes for GPU and AI workloads.
  4. Drive GPU reliability and performance at fleet scale, defining the detection, diagnosis, and automated repair of unhealthy accelerators before they impact production.
  5. Evaluate and onboard new GPU and AI accelerator platforms, networking topologies (NVLink, InfiniBand, RoCE), and multi-node training and inference patterns.

Skills

Required

  • GPU expertise
  • machine management layer
  • GPU host provisioning
  • driver and firmware lifecycle
  • GPU health and reliability
  • large-scale distributed systems and infrastructure
  • Go or other well-structured programming languages
  • CUDA
  • GPU scheduling
  • high-performance networking (NVLink, InfiniBand, RoCE)

Nice to have

  • Kubernetes for GPU workloads
  • bare-metal concepts (firmware, BMC/IPMI/Redfish, OS imaging)

What the JD emphasized

  • GPU expert role
  • own the hard problems that show up only at scale
  • set the technical direction for GPU compute
  • technical anchor
  • deep, hands-on GPU expertise
  • track record as an expert for compute
  • scars to prove you have scaled GPU or accelerator infrastructure
  • anchor expert that an organization relies on for its hardest GPU and compute problems