Sr GPU Infrastructure Software Engineer

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Technology

This role focuses on engineering and maintaining infrastructure for GPU performance testing and validation. The engineer will design and implement solutions for large-scale infrastructure, develop Kubernetes-native controllers and operators, build backend services, and enhance visibility into system metrics. While the company is an AI cloud provider and the role touches AI/ML infrastructure, the core responsibility is infrastructure engineering and testing, not direct AI model development.

What you'd actually do

  1. Design and implement solutions to problems of scale for testing and validation of CoreWeave’s global infrastructure.
  2. Design and develop Kubernetes-native controllers and operators to automate infrastructure workflows.
  3. Build and maintain scalable backend services and APIs (gRPC/REST) in Go or Python.
  4. Develop performance tests and automation workflows to expand hardware validation across the CoreWeave fleet.
  5. Write and maintain Kubernetes custom controllers and operators to automate infrastructure testing.

Skills

Required

  • Go
  • Python
  • Kubernetes operators/controllers

Nice to have

  • Experience testing hardware at scale
  • HPC Experience
  • Experience with AI/ML infrastructure and training / inference

What the JD emphasized

  • GPU performance testing platform
  • writing Kubernetes operators/controllers