Software Engineer, Kubernetes

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +3 · Technology

Software Engineer role focused on building, operating, and scaling Kubernetes-based production infrastructure. Responsibilities include developing automation, implementing monitoring and observability solutions, driving incident response, and engineering for resiliency. The role emphasizes experience with Kubernetes administration, container orchestration, and infrastructure-focused programming.

What you'd actually do

  1. Build, operate, and scale Kubernetes-based production infrastructure that delivers CoreWeave’s products with high reliability and performance.
  2. Develop automation, tooling, and infrastructure as code in Go and other infrastructure-focused languages to enable zero-touch operations, rapid recovery, and seamless deployments.
  3. Design, implement, and maintain monitoring, alerting, and observability solutions—leveraging the Grafana ecosystem and related tools—to proactively identify and resolve production issues.
  4. Drive incident response efforts, participate in on-call rotations, and lead root cause analysis to prevent recurrence and improve incident handling processes.
  5. Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimize AI workload performance and resource utilization at scale.

Skills

Required

  • Kubernetes administration
  • container orchestration
  • microservices architectures
  • automation
  • Go
  • Bash
  • Linux systems
  • Prometheus
  • Grafana
  • Datadog
  • Splunk
  • Loki
  • VictoriaMetrics
  • troubleshooting complex production issues
  • analytical skills
  • communication skills

Nice to have

  • SRE
  • production engineering
  • large-scale infrastructure/platform roles
  • infrastructure as code
  • zero-touch operations
  • rapid recovery
  • seamless deployments
  • monitoring
  • alerting
  • observability solutions
  • incident management
  • resiliency
  • redundancy
  • fault tolerance
  • disaster recovery
  • distributed systems
  • security
  • performance improvements
  • custom Kubernetes operators
  • intelligent orchestration frameworks

What the JD emphasized

  • high reliability and performance
  • high-uptime, customer-facing systems
  • measurable improvements in reliability and performance
  • flawless execution