Production Engineer – Team Lead

Weights & Biases Weights & Biases · Data AI · Singapore · Technology

CoreWeave is seeking a Production Engineer – Team Lead to oversee cloud infrastructure stability and reliability. This role involves leading incident management, defining and tracking SLOs, driving improvements in system resilience, and mentoring the production engineering team. The ideal candidate will have experience in production engineering, cloud platforms, and incident management frameworks.

What you'd actually do

  1. Act as the Incident Commander during incidents, providing decisive leadership to ensure timely and effective resolution while minimizing impact.
  2. Define and track Service Level Objectives (SLOs) and ensure alignment with business goals and team objectives.
  3. Lead the development of the team by training and mentoring Production Engineer I/II in incident management best practices, tools, and systems.
  4. Identify and lead initiatives to improve system resilience, scalability, and disaster recovery capabilities across the platform.
  5. Drive rapid incident resolution, able to drive collective technical deep-dives into unknown systems - connecting SMEs to find the unknown unknowns in their systems and how those systems fit into the broader ecosystem.

Skills

Required

  • production engineering
  • cloud operations
  • site reliability engineering (SRE)
  • incident response
  • cloud platforms (Kubernetes, AWS, GCP)
  • incident management frameworks (ITIL, SRE)
  • monitoring and alerting tools (Prometheus, Grafana)
  • observability principles
  • automation
  • scripting
  • configuration management tools (Python, Bash, Terraform)
  • decision making under pressure
  • communication skills

What the JD emphasized

  • critical incidents
  • high-priority incidents
  • high-stakes incident resolution