Senior Incident Manager

Braze Braze · Enterprise · Austin, TX · Engineering

This role is for a Senior Incident Manager at Braze, focusing on major incident management, process management, program management, and release management within the Technology Operations Team. The goal is to ensure Braze operates as a technology-first business by standardizing, enforcing, and improving processes across engineering departments, with a focus on scalability, observability, and reusability. The role involves creating and executing incident response strategies, leading incident resolution, improving related tools and processes, managing incident-related training, overseeing the incident management process, prioritizing incidents, contributing to blameless post-mortems, translating technical issues into business impacts, and leading the weekly release process.

What you'd actually do

  1. Creating, communicating, and executing the incident response strategy and actions for individual incidents (spanning Security, IT, DevOps, and Product Engineering)
  2. Incident Commanding - Driving resolution of incidents by closely partnering and collaborating with Engineering, Technical Support, and Customer Success
  3. Lead and contribute projects to improve tools and processes related to manageability, observability, resiliency
  4. Manage incident-related training, including cross-training of our SREs, DevOps, and Application Engineers
  5. Overseeing the incident management process and team members involved in resolving the incident

Skills

Required

  • 7+ years in incident management, operations, or technical support experience
  • Able to effectively communicate critical issue status (both verbally and written) to executive staff, go to market teams, and other involved parties
  • Are able to effectively build and maintain relationships with key stakeholders across the business
  • Ability to lead, make decisions, problem solve and work within teams. Can demonstrate flexibility and agility to move between role types within teams
  • Ability to effectively prioritize and execute tasks in a high-pressure environment
  • Experience leading technical incidents and driving them to resolution, whether as part of an on-call team or as an incident manager
  • A strong technical background and experience with specific tools for reporting, documentation, and observability (Jira, Confluence, Datadog, or the equivalent)
  • A good foundational understanding of release management concepts, DevOps, and SRE
  • You have a high degree of operational excellence, use data-driven decision-making to minimize risk, and love building and managing against reports and data

What the JD emphasized

  • Scalability
  • Observability
  • Reusability
  • incident management
  • release management
  • technical incidents
  • observability