What you'd actually do

Analyze systemic failure patterns and design improvements that prevent incident recurrence

Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments

Build tooling and automation to reduce incident response toil and scale team impact

Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack

Analyze reliability data to identify systemic improvements; build dashboards that drive action

Skills

Required

SRE
incident management
reliability engineering
Cloud experience (AWS, GCP, or Azure)
incident management tooling (Rootly, PagerDuty, or similar)
distributed systems
observability (metrics, logging, tracing)
Kubernetes
container orchestration
CI/CD pipelines
release processes
systems thinking
SLO/SLA frameworks
written communication
post-mortem facilitation
async collaboration

Nice to have

Kafka/event streaming expertise
multi-cloud experience (2+ of AWS/GCP/Azure)
AI-assisted workflows

What the JD emphasized

10+ years in SRE, incident management, or reliability engineering

Cloud experience with at least one of AWS, GCP, or Azure

Deep expertise with incident management tooling (Rootly, PagerDuty, or similar platforms)

Strong understanding of distributed systems and failure modes at scale—Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Deep experience with observability: metrics, logging, tracing—ability to diagnose complex issues

Large company experience navigating reliability/incident programs at 500+ engineer organizations

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.

About the Role:

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur.

This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices.

You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. Confluent has 800-1000 engineers across highly autonomous teams. This role sits within Cloud Architecture and Reliability - Supportability (CAR-S), a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less.

What You Will Do:

Proactive Reliability Engineering (~75% of role) · Analyze systemic failure patterns and design improvements that prevent incident recurrence · Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments · Build tooling and automation to reduce incident response toil and scale team impact · Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack · Analyze reliability data to identify systemic improvements; build dashboards that drive action · Explore AI-assisted approaches to documentation quality and incident analysis · Design scalable reliability standards that reduce reactive workload over time.
**Incident Management Program (~25% of role) · **Own standards, practices, and continuous improvement of incident response · Serve as an on-call Incident Commander for production incidents, including acting as escalation IC when incidents exceed a team's management chain · Develop and deliver training programs for engineering teams at all levels · Coach teams through post-mortems and on developing actionable corrective actions
Customer Root Cause Analysis (CRCA) · Edit and review customer-facing incident documents to ensure quality and clarity · Drive turnaround SLAs while maintaining technical accuracy · Ensure clear explanation of what happened, why, and how we'll prevent recurrence
Cross-Team Leadership · Partner with engineering leaders to elevate reliability practices · Be the expert who teams proactively engage for guidance

What You Will Bring:

10+ years in SRE, incident management, or reliability engineering · Cloud experience with at least one of AWS, GCP, or Azure·
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar platforms)
Strong understanding of distributed systems and failure modes at scale—Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
Deep experience with observability: metrics, logging, tracing—ability to diagnose complex issues · Kubernetes and container orchestration experience · Understanding of CI/CD pipelines and release processes · Systems thinking: understanding how infrastructure design choices affect failure modes and recovery · Familiarity with SLO/SLA frameworks.
Track record as a trusted advisor across engineering organizations · Experience driving org-wide process and cultural changes · Strong written communication (design docs, one-pagers, runbooks) · Post-mortem facilitation experience · Experience with async collaboration across time zones
Large company experience navigating reliability/incident programs at 500+ engineer organizations·

What Gives You an Edge:

Multi-cloud experience (minimum 2+ of AWS/GCP/Azure).
Modern CI/CD, GitHub, AI-assisted workflows—you'll have the freedom to build what you need

Ready to build what's next? Let’s get in motion.

Come As You Are

Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.

We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

Privacy Statement

Confluent is an IBM subsidiary which has been acquired by IBM and will be integrated into the IBM organization. By proceeding with this application, you understand that Confluent will share your personal information with other IBM affiliates involved in your recruitment process, wherever these are located. More Information on how IBM protects your personal information, including the safeguards in case of cross-border data transfer, are available here.

About the Role:

What You Will Do:

Proactive Reliability Engineering (~75% of role) · Analyze systemic failure patterns and design improvements that prevent incident recurrence · Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments · Build tooling and automation to reduce incident response toil and scale team impact · Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack · Analyze reliability data to identify systemic improvements; build dashboards that drive action · Explore AI-assisted approaches to documentation quality and incident analysis · Design scalable reliability standards that reduce reactive workload over time.

**Incident Management Program (~25% of role) · **Own standards, practices, and continuous improvement of incident response · Serve as an on-call Incident Commander for production incidents, including acting as escalation IC when incidents exceed a team's management chain · Develop and deliver training programs for engineering teams at all levels · Coach teams through post-mortems and on developing actionable corrective actions

Customer Root Cause Analysis (CRCA) · Edit and review customer-facing incident documents to ensure quality and clarity · Drive turnaround SLAs while maintaining technical accuracy · Ensure clear explanation of what happened, why, and how we'll prevent recurrence

Cross-Team Leadership · Partner with engineering leaders to elevate reliability practices · Be the expert who teams proactively engage for guidance

What You Will Bring:

10+ years in SRE, incident management, or reliability engineering · Cloud experience with at least one of AWS, GCP, or Azure·

Deep expertise with incident management tooling (Rootly, PagerDuty, or similar platforms)

Strong understanding of distributed systems and failure modes at scale—Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Deep experience with observability: metrics, logging, tracing—ability to diagnose complex issues · Kubernetes and container orchestration experience · Understanding of CI/CD pipelines and release processes · Systems thinking: understanding how infrastructure design choices affect failure modes and recovery · Familiarity with SLO/SLA frameworks.

Track record as a trusted advisor across engineering organizations · Experience driving org-wide process and cultural changes · Strong written communication (design docs, one-pagers, runbooks) · Post-mortem facilitation experience · Experience with async collaboration across time zones

Large company experience navigating reliability/incident programs at 500+ engineer organizations·

Come As You Are

Privacy Statement