Production Engineer

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +4 · Technology

Production Engineer role focused on maintaining the reliability and stability of CoreWeave's cloud infrastructure, involving incident response, operational support, and process improvements. Requires experience in cloud operations, SRE, or related technical roles, with familiarity in monitoring tools and scripting.

What you'd actually do

  1. Assist in incident response efforts by helping identify and resolve service disruptions quickly, working under the guidance of more senior engineers.
  2. Help document incidents, assist with root cause analysis (RCA), and support post-incident reviews (PIRs) to identify lessons learned.
  3. Monitor system performance and health using tools like Prometheus and Grafana, identifying any performance issues or potential incidents.
  4. Help implement automation and process improvements to enhance efficiency and reduce manual intervention in incident detection and recovery.
  5. Collaborate with engineers across teams to improve platform reliability, resilience improvements, and disaster recovery.

Skills

Required

  • Cloud operations
  • Site Reliability Engineering (SRE)
  • Cloud platforms (Kubernetes, AWS, GCP)
  • Incident management
  • Monitoring and alerting tools (Prometheus, Grafana)
  • Scripting and automation tools (Python, Bash, Terraform, Ansible)
  • Communication skills

Nice to have

  • Kubernetes
  • Containerization
  • Distributed systems
  • Change management processes
  • Post-incident analysis
  • Automated systems
  • Self-healing infrastructure

What the JD emphasized

  • 4 years of experience in cloud operations, site reliability engineering (SRE), or related technical roles.
  • Understanding of cloud platforms (e.g., Kubernetes, AWS, GCP) and basic knowledge of cloud infrastructure.
  • Familiarity with incident management practices and frameworks (e.g., ITIL, SRE best practices).
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana) or willingness to learn.
  • Basic experience with scripting or automation tools (e.g., Python, Bash, Terraform, Ansible).