Staff Software Engineer I - Sre

Confluent Confluent · Data AI · India · Remote · Engineering

Confluent is seeking a Staff Software Engineer I - SRE to focus on proactive reliability improvements and incident management for their multi-cloud streaming platform. The role involves 75% engineering work (automation, tooling, system analysis) and 25% program ownership (training, incident response practices). The ideal candidate has 10+ years of SRE experience, deep expertise in distributed systems, observability, and incident management tooling, with a preference for Kafka/event streaming.

What you'd actually do

  1. Analyze systemic failure patterns and design improvements that prevent incident recurrence
  2. Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
  3. Build tooling and automation to reduce incident response toil and scale team impact
  4. Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
  5. Analyze reliability data to identify systemic improvements; build dashboards that drive action

Skills

Required

  • SRE
  • incident management
  • reliability engineering
  • Cloud experience (AWS, GCP, or Azure)
  • incident management tooling (Rootly, PagerDuty, or similar)
  • distributed systems
  • observability (metrics, logging, tracing)
  • Kubernetes
  • container orchestration
  • CI/CD pipelines
  • release processes
  • systems thinking
  • SLO/SLA frameworks
  • written communication
  • post-mortem facilitation
  • async collaboration

Nice to have

  • Kafka/event streaming expertise
  • multi-cloud experience (2+ of AWS/GCP/Azure)
  • AI-assisted workflows

What the JD emphasized

  • 10+ years in SRE, incident management, or reliability engineering
  • Cloud experience with at least one of AWS, GCP, or Azure
  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar platforms)
  • Strong understanding of distributed systems and failure modes at scale—Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
  • Deep experience with observability: metrics, logging, tracing—ability to diagnose complex issues
  • Large company experience navigating reliability/incident programs at 500+ engineer organizations