Senior Software Engineer, Compute Architecture

Weights & Biases Weights & Biases · Data AI · New York, NY · Technology

Senior Software Engineer role focused on building the software control plane for hardware lifecycle management in large-scale GPU data centers. The role involves designing and operating Go-based distributed services for infrastructure bring-up, monitoring, automation, and observability, with a focus on reliability and hardware-aware automation.

What you'd actually do

  1. Design, build, and operate Go-based services that manage the lifecycle of large-scale GPU data center infrastructure.
  2. Build automation for data center bring-up, hardware discovery, health monitoring, remediation, and production operations.
  3. Develop reliable APIs, services, and workflows for managing BMCs, firmware state, server health, and rack-level infrastructure.
  4. Improve observability, alerting, and operational tooling so production issues can be detected, understood, and resolved quickly.
  5. Translate incidents and hardware failure modes into software improvements that make the platform more resilient.

Skills

Required

  • 5+ years of experience building and operating infrastructure or backend systems
  • Go
  • gRPC
  • REST APIs
  • Kubernetes
  • containerized workloads
  • Prometheus
  • Grafana

Nice to have

  • GPU-based systems
  • low-level hardware management (BMCs or Redfish)
  • large-scale distributed systems
  • high-throughput infrastructure
  • open-source projects (Go, Redfish)

What the JD emphasized

  • large-scale GPU data centers
  • hardware lifecycle management
  • production reliability
  • hardware-aware automation
  • GPU-based systems
  • low-level hardware management
  • large-scale distributed systems