Bare Metal Support Engineer

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +4 · Remote · Global Field Organization

This role supports the infrastructure for AI workloads, focusing on bare-metal GPU fleet management, customer issue resolution, and operational reliability within a data center environment. It involves troubleshooting hardware, software, and networking issues to ensure seamless customer experiences for AI computations.

What you'd actually do

  1. Provide high-level support for customers utilizing bare-metal GPU fleets on CoreWeave Cloud.
  2. Diagnose, triage, and investigate reported customer issues and high-priority incidents, identifying root causes and escalating when necessary.
  3. Create and maintain internal documentation, including troubleshooting guides, best practices, and knowledge base articles.
  4. Perform in-depth log analysis and debugging across multiple layers of the stack (firmware, drivers, hardware).
  5. Collaborate with engineering teams to improve hardware reliability, software stability, and system performance.

Skills

Required

  • Experience in data centers, GPU clusters, server deployments, system administration, or hardware troubleshooting.
  • Demonstrated experience driving resolutions and continuous improvements across cross-functional environments and teams within a data center environment.
  • Intermediate knowledge of Linux (Ubuntu, CentOS, or similar), including command-line proficiency.
  • Experience with NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing (HPC), and large-scale data center environments.
  • Experience in networking fundamentals (TCP/IP, VLANs, DNS, DHCP) and troubleshooting tools.
  • Hands-on experience with firmware updates, BIOS configurations, and driver management.
  • Experience analyzing system logs and debugging issues across firmware, drivers, and hardware layers.
  • Experience working with Jira, Confluence, Notion, or other issue-tracking and documentation platforms.
  • Experience in scripting and automation (Python, Bash, Ansible, or similar).

Nice to have

  • Kubernetes
  • Docker
  • containerized infrastructure
  • strong problem-solving skills
  • proactive and analytical mindset
  • excellent communication skills
  • demonstrated ability to work collaboratively in a fast-paced environment

What the JD emphasized

  • 24/7/365 team
  • on-call rotation