GPU Infrastructure Software Engineer

Weights & Biases Weights & Biases · Data AI · Livingston, NJ +1 · Technology

Software Engineer role focused on designing, implementing, and maintaining cloud infrastructure, platforms, and internal tools, specifically to improve the efficiency, reliability, and scalability of systems for GPU performance testing and validation. The role involves working with Kubernetes, Go/Python, and enhancing visibility into system metrics.

What you'd actually do

  1. Implement and maintain solutions to problems of scale for testing and validation of CoreWeave’s global infrastructure.
  2. Develop and improve performance tests and automation workflows to expand hardware validation across the CoreWeave fleet.
  3. Troubleshoot and maintain Kubernetes custom controllers and operators that automate infrastructure testing.
  4. Adapt and extend open source tooling to enhance visibility into system metrics, performance, and health.
  5. Participate in an on-call rotation.

Skills

Required

  • Go
  • Python
  • Kubernetes

Nice to have

  • backend services
  • testing hardware at scale
  • HPC Experience
  • AI/ML infrastructure and training / inference

What the JD emphasized

  • GPU performance testing platform
  • Kubernetes at production scale
  • Go and/or Python software development