Operations Engineer, Fleet Reliability

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +4 · Technology

CoreWeave is seeking an Operations Engineer for Fleet Reliability to manage and maintain their GPU supercomputing clusters. Responsibilities include provisioning, troubleshooting hardware/software issues, monitoring system performance, and creating documentation. Requires strong Linux system administration and scripting skills, with preferred experience in data center infrastructure, observability platforms, and HPC.

What you'd actually do

  1. Configure and maintain large-scale high-performance supercomputing clusters running state-of-the-art GPUs
  2. Troubleshoot hardware and software issues; escalate and coordinate as needed with data center, network, hardware and platform teams to drive resolution
  3. Monitor and analyze system performance and take appropriate remediation actions for cloud health
  4. Approach your work with flexibility and optimism anticipating shifting business and technical priorities
  5. Create and maintain documentation of team processes, knowledge and best practices for system management

Skills

Required

  • Linux system administration
  • troubleshoot hardware and software issues
  • system maintenance tasks
  • Software development or scripting languages (bash, python, powershell, etc)

Nice to have

  • 2 + years of experience troubleshooting or administering data center or on-prem infrastructure (servers, storage, network or a mix)
  • Grafana, Prometheus, promsql queries or similar observability platforms
  • Data center environments including server racks, HVAC systems, fiber trays
  • Kubernetes administration
  • HPC - administering GPU-related workloads

What the JD emphasized

  • state-of-the-art GPUs