Senior Software Engineer - Robinhood Command Center

Robinhood Robinhood · Fintech · Menlo Park, CA · ENG Technical Assurance

Senior Software Engineer for Robinhood's Command Center (RCC), a new reliability team focused on detecting, coordinating, and mitigating production incidents. This role involves leading incident response, defining reliability and observability strategy, developing incident management processes and tooling, and driving post-incident learning. The engineer will not own product services but will own the processes and tools for incident response.

What you'd actually do

  1. Serve as a senior technical leader driving the long-term reliability and observability strategy across Robinhood’s infrastructure
  2. Partner closely across many different types of engineers to raise the bar for operational excellence and incident response
  3. Lead incident mitigation efforts by coordinating service owners, facilitating time-sensitive decisions like rollbacks, traffic shifts, and maintaining a clear source of truth during active incidents
  4. Develop and maintain incident management processes and procedures to ensure timely resolution and minimize customer impact
  5. Own incident discovery at the company level by defining and maintaining global dashboards and alerts tied to critical user journeys (CUJs), availability, and business-impact metrics

Skills

Required

  • 5+ years of software engineering experience, including significant experience operating production systems
  • 2+ years focused on reliability engineering, infrastructure, distributed systems, or production operations
  • Hands-on experience serving in incident leadership roles (e.g., IMOC, incident commander, primary oncall)
  • Strong communication and cross-functional collaboration skills, especially during high-severity incidents
  • Deep knowledge of systems reliability, observability frameworks, and fault-tolerant architecture design
  • Experience with multi-region or multi-cluster architectures, capacity planning, and failover strategies
  • Familiarity with modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana)
  • Demonstrated ability to drive measurable improvements in MTTD, MTTR, availability, or customer impact

What the JD emphasized

  • reliability engineering
  • incident leadership
  • operational excellence
  • incident response
  • observability frameworks
  • fault-tolerant architecture design
  • modern observability stacks
  • MTTD
  • MTTR