Site Reliability Engineer

Adobe Adobe · Enterprise · Noida, India

Site Reliability Engineer for Illustrator Enterprise Services, focusing on defining and leading the reliability strategy for high-traffic, globally distributed systems. The role involves setting technical direction for reliability engineering, enabling large-scale automated creative workflows, and architecting resilient, observable, and self-healing systems. Key responsibilities include building automation frameworks, introducing AI/ML-based predictive monitoring, leading reliability initiatives like chaos engineering and SLO adoption, and ensuring comprehensive observability. The role also involves incident response, performance tuning, capacity engineering, and cross-team leadership and mentorship.

What you'd actually do

  1. Define and drive the long-term reliability and scalability strategy for the illustrator enterprise services, aligning with product and business goals.
  2. Build and champion advanced automation frameworks that enable zero-touch operations across deployment, recovery, and scaling workflows.
  3. Introduce AI/ML-based predictive monitoring and anomaly detection systems to anticipate failures before they impact users.
  4. Serve as a technical authority during high-impact incidents, guiding cross-functional teams through real-time mitigation and long-term prevention.
  5. Mentor and coach SREs and software engineers, cultivating deep reliability-first thinking across teams.

Skills

Required

  • site reliability
  • production engineering
  • large-scale distributed system operations
  • cloud-native environments (AWS, Azure, GCP)
  • Python
  • Go
  • Java
  • Bash
  • Kubernetes
  • microservices
  • service mesh architectures
  • Infrastructure as Code (Terraform, CloudFormation)
  • CI/CD automation frameworks
  • observability and monitoring stacks (Prometheus, Grafana, Datadog, OpenTelemetry)
  • networking
  • storage
  • distributed databases (SQL and NoSQL)
  • architectural decisions
  • reliability strategy

Nice to have

  • reliability frameworks
  • SRE platforms
  • error budgets
  • chaos engineering
  • reliability reviews
  • high-traffic or latency-sensitive systems
  • big data ecosystems (Kafka, Spark, Hadoop)
  • large-scale data ingestion pipelines
  • security
  • compliance
  • governance in production environments (SOC2, GDPR, ISO27001)
  • Cloud or Kubernetes certifications
  • Published contributions or conference talks on reliability, automation, or distributed systems

What the JD emphasized

  • AI/ML-based predictive monitoring
  • anomaly detection systems
  • zero single points of failure
  • error budgets
  • SLO adoption
  • chaos engineering
  • observability architecture
  • high-impact incidents
  • reliability reviews
  • operational readiness assessments
  • performance tuning
  • capacity engineering
  • architectural bottlenecks
  • platform evolution
  • reliability-first thinking
  • automation-first culture
  • technical standards
  • design reviews
  • highly available
  • globally distributed systems
  • cloud-native environments
  • Kubernetes
  • microservices
  • service mesh architectures
  • Infrastructure as Code
  • CI/CD automation frameworks
  • observability and monitoring stacks
  • networking
  • storage
  • distributed databases
  • architectural decisions
  • reliability strategy
  • reliability frameworks
  • SRE platforms
  • error budgets
  • chaos engineering
  • reliability reviews
  • high-traffic
  • latency-sensitive systems
  • security
  • compliance
  • governance