Manager, Incident Ops and Observability

F5 F5 · Enterprise · Seattle, WA

This role manages the Incident Response (IR) program, focusing on optimizing processes from detection to post-incident analysis. It involves building and managing monitoring tools and services, leading incident response efforts, defining policies, driving improvements through automation, and reporting on KPIs for service reliability and observability. The role also establishes and leads Problem Management, Change Management, and Configuration Management functions, and collaborates with ServiceNow for incident management.

What you'd actually do

  1. Lead the global Incident Response (IR) program, optimizing processes across detection, triage, containment, remediation, and post-incident analysis.
  2. Hire, mentor and train global team members on incident response best practices and observability tooling.
  3. Serve as a technical lead and head engineer for creation and management of monitoring tools and services to support F5 infrastructure and business systems.
  4. Serve as the primary incident commander during major incidents, ensuring timely resolution, excellent communication, and stakeholder alignment.
  5. Define and continuously refine incident response policies, procedures, and runbooks to ensure consistent and effective handling of incidents.

Skills

Required

  • Incident response
  • NOC/SOC/SRE
  • Monitoring
  • Observability
  • Cloud and hybrid environments
  • Problem Management
  • Change Management
  • Configuration Management
  • ITSM platform (e.g., ServiceNow)
  • Observability tools (e.g. Grafana, ThousandEyes, LogicMonitor, Pingdom, Zabbix)
  • AWS
  • Google Workspace
  • SaaS platforms
  • Leadership
  • Communication skills

Nice to have

  • Infrastructure, IT, or security organizations experience
  • Tableau, PowerBI, or other reporting/analytics platforms
  • SIEM, SOAR, and log analysis tools (e.g., Splunk, DataDog, Panther, Crowdstrike)
  • ITIL V4 and/or Six Sigma certifications

What the JD emphasized

  • 10+ years managing incident response within NOC/SOC/SRE teams with a focus on monitoring and observability.
  • Proven track record of managing complex operational incidents in cloud and hybrid environments.
  • Experience driving continuous improvement and operational excellence in processes such as Problem Management, Change Management, and Configuration Management.
  • Experience working with and/or managing CMDB governance leveraging and ITSM platform (e.g., ServiceNow)
  • Experience integrating runbooks, operational processes, and metrics reporting into an ITSM platform (e.g., ServiceNow)
  • Experience with observability tools, especially tooling focused on synthetics, metrics, and infrastructure telemetry (e.g. Grafana, ThousandEyes, LogicMonitor, Pingdom, Zabbix)