Staff Production Engineer

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +3 · Technology

This role focuses on building and operating foundational platforms and frameworks for a cloud infrastructure provider, emphasizing reliability, observability, and scalability. The engineer will design, build, and own systems that reduce operational toil, improve delivery velocity, and enhance availability and resiliency. Key responsibilities include developing automation, self-service capabilities, and paved paths for operational excellence, as well as participating in incident response and shipping production code. The role requires deep expertise in distributed systems, cloud-native platforms (especially Kubernetes), and observability practices.

What you'd actually do

  1. design, build, and own the foundational platforms and frameworks that underpin operational excellence across CoreWeave
  2. develop a deep understanding of CoreWeave’s infrastructure and services, shape architecture and tooling decisions, and partner closely with service owners to operationalize reliability through automation and paved paths rather than manual process or advocacy
  3. build and evolve systems for observability, alerting, automated remediation, resiliency testing, and authoritative sources of truth, operationalizing best practices through tooling rather than manual enforcement
  4. participate in incident response for critical outages with the explicit goal of improving systems, tooling, and defaults to reduce future operational load—not as a long-term escalation path
  5. ship production code, participate in on-call rotations as needed, and mentor engineers on platform ownership, operational design, and sustainable production practices

Skills

Required

  • building and operating distributed systems or cloud platforms at scale
  • diagnose and resolve complex production failures
  • programming experience (Python, Go, or similar)
  • shipping and operating production systems
  • cloud-native platforms and distributed systems
  • Kubernetes
  • observability and incident practices
  • metrics
  • tracing
  • structured logs
  • SLIs/SLOs
  • PIRs
  • lead large technical efforts
  • influence outcomes across teams
  • delivering durable, platform-driven improvements
  • reduce operational risk
  • scale with organizational growth

Nice to have

  • Ownership of foundational internal platforms or frameworks used broadly across an organization
  • service tiering
  • disaster recovery or business continuity planning
  • chaos engineering
  • structured resilience programs
  • operating large-scale AI/cloud infrastructure
  • guiding organizations through rapid scale while maintaining operational quality and discipline

What the JD emphasized

  • 10+ years of experience building and operating distributed systems or cloud platforms at scale
  • Demonstrated ability to diagnose and resolve complex production failures across services, infrastructure, and automation layers
  • Strong programming experience (Python, Go, or similar) with a history of shipping and operating production systems
  • Deep expertise in cloud-native platforms and distributed systems, especially Kubernetes
  • Advanced experience with observability and incident practices, including metrics, tracing, structured logs, SLIs/SLOs, and PIRs
  • Proven ability to lead large technical efforts and influence outcomes across teams without direct authority
  • Track record of delivering durable, platform-driven improvements that reduce operational risk and scale with organizational growth
  • Ownership of foundational internal platforms or frameworks used broadly across an organization
  • Background operating large-scale AI/cloud infrastructure