Operations Engineering Manager, Fleet Reliability

Weights & Biases Weights & Biases · Data AI · Bellevue, WA · Technology

CoreWeave is seeking an Operations Manager for their Fleet Reliability Operations team. This role focuses on managing a 24/7 team responsible for provisioning, updating, triaging server nodes, and executing processes and tooling for server fleet configuration and validation. The manager will also develop talent pipelines, manage onboarding and training, and champion reliability and customer satisfaction. The role requires experience in software or infrastructure engineering with a leadership capacity, and knowledge of SRE fundamentals, incident management, observability, and change management. The company is a cloud provider specializing in AI infrastructure.

What you'd actually do

  1. Build and lead a 24/7 team of process-oriented, reliability and observability-focused engineers.
  2. Lead the socialization and documentation of clear and consistent processes for provisioning, validating and troubleshooting nodes in our server fleet.
  3. Think critically about and advocate for process and automation improvements prioritizing event-driven automated remediation as the end goal.
  4. Provide a 24/7 engineering support function for high-criticality, time-sensitive node delivery and maintenance.
  5. Drive and improve our program of onboarding, documentation, enablement, and performance management to help your team members achieve new heights of personal growth and capability.

Skills

Required

  • Leadership experience
  • SRE fundamentals
  • Incident management
  • Observability
  • Change management
  • Automation
  • Process improvement
  • Talent development
  • Onboarding and training

Nice to have

  • Hardware issues in production
  • Event-driven automated remediation

What the JD emphasized

  • seven or more years of experience in a software or infrastructure engineering industry, of which at least two years were in a leadership capacity
  • background that includes the knowledge and practice of SRE fundamentals, incident management, blameless culture, observability, and change management
  • champion reliability and customer satisfaction
  • process and automation improvements