What you'd actually do

Serve as the GPU technical leader for the Compute team, partnering across Kubernetes, Machine Bootstrap, Networking, and Cloud to drive GPU strategy end to end.

Own the GPU host lifecycle above raw fleet management: driver, firmware, and CUDA stack management, GPU health and telemetry, and remediation of GPU-specific failures (XID errors, ECC, thermal, NVLink and fabric faults).

Architect how GPU capacity is exposed to compute platforms, including scheduling, isolation, and integration with Kubernetes for GPU and AI workloads.

Drive GPU reliability and performance at fleet scale, defining the detection, diagnosis, and automated repair of unhealthy accelerators before they impact production.

Evaluate and onboard new GPU and AI accelerator platforms, networking topologies (NVLink, InfiniBand, RoCE), and multi-node training and inference patterns.

Skills

Required

GPU expertise
machine management layer
GPU host provisioning
driver and firmware lifecycle
GPU health and reliability
large-scale distributed systems and infrastructure
Go or other well-structured programming languages
CUDA
GPU scheduling
high-performance networking (NVLink, InfiniBand, RoCE)

Nice to have

Kubernetes for GPU workloads
bare-metal concepts (firmware, BMC/IPMI/Redfish, OS imaging)

What the JD emphasized

GPU expert role

own the hard problems that show up only at scale

set the technical direction for GPU compute

technical anchor

deep, hands-on GPU expertise

track record as an expert for compute

scars to prove you have scaled GPU or accelerator infrastructure

anchor expert that an organization relies on for its hardest GPU and compute problems

Every day, tens of millions of people come to Roblox to explore, create, play, learn, and connect with friends in 3D immersive digital experiences– all created by our global community of developers and creators.

At Roblox, we’re building the tools and platform that empower our community to bring any experience that they can imagine to life. Our vision is to reimagine the way people come together, from anywhere in the world, and on any device.** **We’re on a mission to connect a billion people with optimism and civility, and looking for amazing talent to help us get there.

A career at Roblox means you’ll be working to shape the future of human interaction, solving unique technical challenges at scale, and helping to create safer, more civil shared experiences for everyone.

As a Principal Software Engineer on the Compute team, you will be the technical anchor for Roblox's GPU and AI accelerator capabilities. This is a battle-tested GPU expert role focused on the machine management layer and above: how GPU hosts are made production-ready, kept healthy, and turned into reliable compute for the workloads that depend on them. You will own the hard problems that show up only at scale, from driver and firmware management to GPU health, reliability, and performance across a rapidly growing fleet of accelerators spanning Roblox data centers and cloud environments. You will set the technical direction for GPU compute and up-level the entire organization's GPU expertise.

You will:

Serve as the GPU technical leader for the Compute team, partnering across Kubernetes, Machine Bootstrap, Networking, and Cloud to drive GPU strategy end to end.
Own the GPU host lifecycle above raw fleet management: driver, firmware, and CUDA stack management, GPU health and telemetry, and remediation of GPU-specific failures (XID errors, ECC, thermal, NVLink and fabric faults).
Architect how GPU capacity is exposed to compute platforms, including scheduling, isolation, and integration with Kubernetes for GPU and AI workloads.
Drive GPU reliability and performance at fleet scale, defining the detection, diagnosis, and automated repair of unhealthy accelerators before they impact production.
Evaluate and onboard new GPU and AI accelerator platforms, networking topologies (NVLink, InfiniBand, RoCE), and multi-node training and inference patterns.
Establish the standards, tooling, and APIs that let other engineering teams consume GPU compute safely and efficiently, reducing toil and raising the bar for the org.

You have:

10+ years of experience building and operating large-scale distributed systems and infrastructure.
Deep, hands-on GPU expertise at the machine management layer and above: GPU host provisioning, driver and firmware lifecycle, GPU health and reliability, and the realities of running accelerators in production.
A track record as an expert for compute, not just fleet management, with the scars to prove you have scaled GPU or accelerator infrastructure that other teams depend on.
Strong proficiency in Go or other well-structured programming languages.
Experience operating GPU and AI workloads in production, including familiarity with CUDA, GPU scheduling, and high-performance networking (NVLink, InfiniBand, RoCE).
Familiarity with Kubernetes for GPU workloads and with bare-metal concepts (firmware, BMC/IPMI/Redfish, OS imaging) is a strong plus.
A history of being the anchor expert that an organization relies on for its hardest GPU and compute problems, and the leadership to up-level the engineers around you.

For roles that are based at our headquarters in San Mateo, CA: The starting base pay for this position is as shown below. The actual base pay is dependent upon a variety of job-related factors such as professional background, training, work experience, location, business needs and market demand. Therefore, in some circumstances, the actual salary could fall outside of this expected range. This pay range is subject to change and may be modified in the future. All full-time employees are also eligible for equity compensation and for benefits as described on this page.

Annual Salary Range

$345,040—$399,420 USD

Roles that are based in an office are onsite Tuesday, Wednesday, and Thursday, with optional presence on Monday and Friday (unless otherwise noted).

Roblox provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws. Roblox also provides reasonable accommodations to candidates with qualifying disabilities or religious beliefs during the recruiting process.

For US based roles only, please note the Company may not be able to employ candidates for this role who have United States work authorization related to certain U.S. visa categories, or support future H-1B sponsorship at this time.

You will:

Serve as the GPU technical leader for the Compute team, partnering across Kubernetes, Machine Bootstrap, Networking, and Cloud to drive GPU strategy end to end.
Own the GPU host lifecycle above raw fleet management: driver, firmware, and CUDA stack management, GPU health and telemetry, and remediation of GPU-specific failures (XID errors, ECC, thermal, NVLink and fabric faults).
Architect how GPU capacity is exposed to compute platforms, including scheduling, isolation, and integration with Kubernetes for GPU and AI workloads.
Drive GPU reliability and performance at fleet scale, defining the detection, diagnosis, and automated repair of unhealthy accelerators before they impact production.
Evaluate and onboard new GPU and AI accelerator platforms, networking topologies (NVLink, InfiniBand, RoCE), and multi-node training and inference patterns.
Establish the standards, tooling, and APIs that let other engineering teams consume GPU compute safely and efficiently, reducing toil and raising the bar for the org.

You have:

10+ years of experience building and operating large-scale distributed systems and infrastructure.
Deep, hands-on GPU expertise at the machine management layer and above: GPU host provisioning, driver and firmware lifecycle, GPU health and reliability, and the realities of running accelerators in production.
A track record as an expert for compute, not just fleet management, with the scars to prove you have scaled GPU or accelerator infrastructure that other teams depend on.
Strong proficiency in Go or other well-structured programming languages.
Experience operating GPU and AI workloads in production, including familiarity with CUDA, GPU scheduling, and high-performance networking (NVLink, InfiniBand, RoCE).
Familiarity with Kubernetes for GPU workloads and with bare-metal concepts (firmware, BMC/IPMI/Redfish, OS imaging) is a strong plus.
A history of being the anchor expert that an organization relies on for its hardest GPU and compute problems, and the leadership to up-level the engineers around you.

Annual Salary Range

$345,040—$399,420 USD

Roles that are based in an office are onsite Tuesday, Wednesday, and Thursday, with optional presence on Monday and Friday (unless otherwise noted).