Hardware Engineer, GPU & Pcie

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +2 · Technology

CoreWeave is seeking a Hardware Engineer to troubleshoot, develop, and optimize server hardware infrastructure, focusing on GPU and PCIe components. The role involves automating the hardware lifecycle, collaborating with vendors, and ensuring high-performance, reliable hardware solutions for their AI cloud platform.

What you'd actually do

  1. Troubleshoot complex GPU and PCIe related failures
  2. Partner with external vendors on failure analysis
  3. Track component RMAs
  4. Develop and maintain hardware/firmware management services.
  5. Automate all aspects of the server hardware lifecycle.

Skills

Required

  • 2+ years of prior experience supporting and troubleshooting data center class GPUs ( H100 or newer, including Infiniband and NVLink)
  • Proficiency in ansible/python and experience with programmatically interacting with server BMCs, using IPMI or Redfish (preferably Redfish)
  • Experience using, integrating and automating data center class GPU diagnostics and troubleshooting tools, including observability platforms like prometheus and grafana
  • In-depth knowledge of server hardware, components, and management technologies, particularly GPUs and PCIe devices
  • Strong passion for automation, with a commitment to automating processes comprehensively.

Nice to have

  • Proven ability to stay updated with the latest industry technologies and trends.
  • Previous experience collaborating with hardware vendors to identify novel issues, generate operational playbooks, create alerts and drive issue resolution to completion
  • Excellent documentation skills and attention to detail.
  • Strong analytical and problem-solving abilities.

What the JD emphasized

  • data center class GPUs ( H100 or newer, including Infiniband and NVLink)
  • programmatically interacting with server BMCs, using IPMI or Redfish (preferably Redfish)
  • automating data center class GPU diagnostics and troubleshooting tools, including observability platforms like prometheus and grafana
  • passion for automation