Senior Software Engineer, Server Fleet Infrastructure

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +3 · Technology

CoreWeave is seeking a Senior Software Engineer for their Server Fleet Infrastructure team. This role involves designing and building software to manage large-scale bare metal compute infrastructure across globally distributed data centers, focusing on automation, fleet lifecycle management, and observability. The engineer will work with technologies like Go, Python, Ansible, Linux, gRPC, and Kubernetes, and will be responsible for developing provisioning services, custom controllers, and monitoring solutions to ensure reliable and efficient infrastructure for AI workloads.

What you'd actually do

  1. Design and build software that manages complex infrastructure across globally distributed datacenters.
  2. Whether you’re automating bare metal, building fleet lifecycle management services, solving multi-layer integration challenges, or observing our globally distributed fleet, your work will be critical to the company's delivery of reliable and efficient infrastructure.
  3. Design and implement solutions to problems of scale for multi-site deployment and management of CoreWeave’s global server hardware fleet.
  4. Build and maintain backend services and APIs (gRPC/REST) in Go or Python to interact with Kubernetes and other infrastructure systems.
  5. Develop provisioning services, automation workflows, and fleet management tools that span from bare metal to container orchestration.

Skills

Required

  • 5+ years of experience in software or infrastructure engineering
  • Proficiency in Go and/or Python software development
  • Familiarity with CI/CD tools like Argo, Flux, and GitHub Actions
  • Strong understanding of Linux internals

Nice to have

  • Experience designing, implementing, and monitoring Kubernetes operators for custom resource definitions
  • Experience with infrastructure automation and configuration management tools like Ansible, Puppet, Chef, Salt
  • Experience with distributed cloud computing principles, including testing strategies, observability, error budgets, and fault-tolerant design
  • Experience implementing metrics pipelines, custom alerts, and monitoring strategies
  • Ability to break down complex problems into achievable tasks and collaborate with teammates to execute them
  • Willingness and ability to thrive in a fast-paced startup environment

What the JD emphasized

  • AI workloads
  • GPU compute