Staff Software Engineer, Compute Architecture

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +3 · Technology

Staff Software Engineer focused on architecting and operating large-scale GPU-driven data center infrastructure for AI workloads. The role involves building and enhancing infrastructure services, automation, and orchestration for tens of thousands of GPU servers and related hardware, with a strong emphasis on reliability, security, and scalability. Responsibilities include technical leadership, API development, Kubernetes orchestration, CI/CD pipelines, and championing observability.

What you'd actually do

  1. Provide technical leadership in designing, architecting, and operating large-scale infrastructure services for GPU servers, with a focus on security, reliability, and scalability.
  2. Build and enhance infrastructure services and automation, including inventory management systems and lifecycle management solutions using open source technologies.
  3. Drive strategic direction for infrastructure automation, lifecycle management, and service orchestration, making MetalDev core services more scalable and resilient.
  4. Define best practices for API development (REST/gRPC), distributed databases, and Kubernetes orchestration—while mentoring engineers to follow your lead.
  5. Partner with hardware, software, and operations teams to align infrastructure with business impact.

Skills

Required

  • Go
  • REST/gRPC APIs
  • Kubernetes
  • distributed databases
  • observability stacks (Prometheus, Grafana, PromQL)
  • CI/CD pipelines
  • incident response
  • postmortems
  • service reliability

Nice to have

  • Kafka
  • ClickHouse
  • CRDB
  • DMTF
  • RedFish APIs
  • GPU servers

What the JD emphasized

  • large-scale infrastructure services for GPU servers
  • tens of thousands of GPU servers
  • NVIDIA GB200 and GB300
  • GPU server platforms
  • large-scale datacenter and cloud environments
  • large fleets of GPU servers