Architect, Site Reliability Engineering

Adobe Adobe · Enterprise · San Jose, CA +1

This role is for a Senior SRE Technical Lead/Architect responsible for the reliability, scalability, and operational excellence of Adobe's RealTime Customer Data Platform (RTCDP). The role involves owning day-to-day production reliability, leading incident response, and providing technical leadership for core datastores (Aerospike, FoundationDB, Postgres, CosmosDB, DynamoDB). It also includes designing automation solutions, improving monitoring and observability, and influencing architectural decisions for scale and operability. The role emphasizes SRE and DevOps principles and requires experience running large-scale distributed systems.

What you'd actually do

  1. Own day2 production reliability for RTCDP, ensuring availability, performance, and durability aligned with SLOs.
  2. Serve as a technical lead during incidents, driving mitigation, recovery, and post incident analysis for SEV3 through SEV1 (CSO) events.
  3. Provide technical leadership for RTCDP’s core datastores, including: Aerospike, FoundationDB, Postgres, CosmosDB, DynamoDB
  4. Drive reliability, scalability, upgrade, backup/restore, and disaster recovery strategies.
  5. Design and build automation first solutions to reduce toil and improve system safety.

Skills

Required

  • 10+ yrs of experience running largescale distributed systems
  • Strong background in datastores
  • Strong background in reliability engineering
  • Strong background in automation
  • Proven experience leading production incidents
  • Proven experience driving operational improvements
  • Comfortable operating with high ownership, ambiguity, and scale
  • Mentoring engineers
  • Influencing teams across geographies

Nice to have

  • SRE principles
  • DevOps principles
  • error budgets
  • continuous improvement

What the JD emphasized

  • high ownership, high impact role
  • production operations (Day2 ownership)
  • core datastore engineering
  • Own Production Reliability
  • day2 production reliability
  • technical lead during incidents
  • production ready launches
  • core datastores
  • datastore automation
  • cost optimization and efficiency improvements
  • automation first solutions
  • reliability, scalability, and operability
  • monitoring, alerting, and observability
  • senior technical authority
  • high ownership, ambiguity, and scale
  • largescale distributed systems
  • datastores, reliability engineering, and automation
  • production incidents
  • own critical systems