Manager, Engineering (production Orchestration)

Cockroach Labs Cockroach Labs · Data AI · New York, NY · Engineering

Manager, Engineering for Production Orchestration team at Cockroach Labs. Focuses on reliability, availability, and scalability of CockroachDB. Drives automation, tooling, and AI-driven approaches for development and operations. Manages a global team, coaches engineers, and contributes to foundational architecture. Requires experience in SRE, distributed systems, and managing global operations. Experience with Go and Python is expected.

What you'd actually do

  1. Lead the Production Orchestration team, focused on the reliability, availability, and scalability of CockroachDB in production.
  2. Own operational excellence. Ensure the team is meeting or exceeding our SLAs, running effective incident response, and continuously improving our operational posture. Every incident is treated as a learning opportunity.
  3. Partner across the global Production Engineering organization to align on shared goals, ensure smooth coordination across time zones, and drive cohesive execution.
  4. Drive automation and tooling. Relentlessly reduce operational toil by building systems that improve observability and scale our fleet without scaling headcount linearly.
  5. Leverage AI to improve how the team builds and operates. Help the team adopt AI-assisted development practices and identify applied AI opportunities to improve operational workflows, from alert triage to capacity planning to incident response.

Skills

Required

  • Experience leading global operations and/or incident management and response
  • Experience working on complex technical products with exposure to distributed systems, cloud infrastructure, container orchestration, or large-scale fleet management
  • A strong SRE or Production Engineering background
  • Comfort with programming languages like Go and Python
  • Solid systems architecture knowledge
  • Experience with performance management

Nice to have

  • Go (if not known, will learn)

What the JD emphasized

  • AI-driven approaches to both development and operations
  • applied AI opportunities to improve operational workflows
  • foundational architectural changes to how we operate our fleet
  • new architectural initiative that will reshape how we operate our fleet
  • experience leading global operations and/or incident management and response
  • large-scale fleet management
  • SRE or Production Engineering background
  • performance management