Manager, Bare Metal Support Engineering

Weights & Biases Weights & Biases · Data AI · Singapore · Global Field Organization

Manager of Bare Metal Support Engineering role at CoreWeave, focusing on leading a team to maintain and optimize physical infrastructure (servers, GPUs, power, cooling) for AI workloads. Responsibilities include daily support operations, incident triage, escalation management, process improvement, and client communication, ensuring the stability and performance of the cloud platform for AI clients.

What you'd actually do

  1. Lead a skilled team responsible for maintaining and optimizing physical infrastructure across multiple client environments.
  2. Build, develop, and lead a dedicated Infrastructure Support team focused on supporting key infrastructure, handling escalations, and ensuring smooth hardware operations.
  3. Oversee the resolution of infrastructure-related incidents, escalation management, and collaborate with internal teams to deliver effective solutions.
  4. Improve support processes to enhance efficiency and reduce downtime, ensuring the infrastructure meets client expectations.
  5. Work closely with product, infrastructure, and other teams to ensure seamless delivery of infrastructure resources.

Skills

Required

  • leading teams responsible for infrastructure support, data center operations, or physical compute environments
  • Linux system administration
  • command-line tools
  • hardware-level diagnostics, troubleshooting, and replacement
  • high-performance rack-scale hardware
  • GPU infrastructure
  • incident and escalation management
  • ticket-based workflows
  • interpreting and acting on metrics (MTTR, SLOs, backlog, ticket trends)
  • managing scheduling, shift coverage, and team logistics

Nice to have

  • managing infrastructure support teams in high-growth or rapidly evolving environments
  • develop and implement operational processes that scale with business needs
  • server and GPU hardware lifecycle management
  • coaching and growing technical teams
  • developing and interpreting metrics
  • AI/ML workloads, cluster utilization patterns, or the infrastructure needs of GPU-heavy clients

What the JD emphasized

  • hardware-level diagnostics, troubleshooting, and replacement
  • high-performance rack-scale hardware
  • GPU infrastructure
  • incident and escalation management
  • client communication during escalations