What you'd actually do

Define and drive the long-term reliability and scalability strategy for the illustrator enterprise services, aligning with product and business goals.

Build and champion advanced automation frameworks that enable zero-touch operations across deployment, recovery, and scaling workflows.

Introduce AI/ML-based predictive monitoring and anomaly detection systems to anticipate failures before they impact users.

Serve as a technical authority during high-impact incidents, guiding cross-functional teams through real-time mitigation and long-term prevention.

Mentor and coach SREs and software engineers, cultivating deep reliability-first thinking across teams.

Skills

Required

site reliability
production engineering
large-scale distributed system operations
cloud-native environments (AWS, Azure, GCP)
Python
Go
Java
Bash
Kubernetes
microservices
service mesh architectures
Infrastructure as Code (Terraform, CloudFormation)
CI/CD automation frameworks
observability and monitoring stacks (Prometheus, Grafana, Datadog, OpenTelemetry)
networking
storage
distributed databases (SQL and NoSQL)
architectural decisions
reliability strategy

Nice to have

reliability frameworks
SRE platforms
error budgets
chaos engineering
reliability reviews
high-traffic or latency-sensitive systems
big data ecosystems (Kafka, Spark, Hadoop)
large-scale data ingestion pipelines
security
compliance
governance in production environments (SOC2, GDPR, ISO27001)
Cloud or Kubernetes certifications
Published contributions or conference talks on reliability, automation, or distributed systems

What the JD emphasized

AI/ML-based predictive monitoring

anomaly detection systems

zero single points of failure

error budgets

SLO adoption

chaos engineering

observability architecture

high-impact incidents

reliability reviews

operational readiness assessments

performance tuning

capacity engineering

architectural bottlenecks

platform evolution

reliability-first thinking

automation-first culture

technical standards

design reviews

highly available

globally distributed systems

cloud-native environments

Kubernetes

microservices

service mesh architectures

Infrastructure as Code

CI/CD automation frameworks

observability and monitoring stacks

networking

storage

distributed databases

architectural decisions

reliability strategy

reliability frameworks

SRE platforms

error budgets

chaos engineering

reliability reviews

high-traffic

latency-sensitive systems

security

compliance

governance

Looking for a site reliability engineer to define and lead the reliability strategy for **Illustrator Enterprise Services **- a high-traffic, globally distributed Illustrator workflows. In this role, you will set the technical direction for reliability engineering, enable services for large-scale automated creative workflows across packaging, personalisation and asset generation.

System Architecture & Technical Strategy Define and drive the long-term reliability and scalability strategy for the illustrator enterprise services, aligning with product and business goals.Architect large-scale, distributed, and multi-region systems designed for resiliency, observability, and self-healing.Anticipate systemic risks and design proactive mitigation strategies — ensuring zero single points of failure across critical services.Partner with software architecture and infrastructure teams to evolve the platform toward greater reliability, efficiency, and cost optimization.

Automation, Observability & Reliability Engineering Build and champion advanced automation frameworks that enable zero-touch operations across deployment, recovery, and scaling workflows.Introduce AI/ML-based predictive monitoring and anomaly detection systems to anticipate failures before they impact users.Lead organization-wide reliability initiatives — such as chaos engineering, error budgets, and SLO adoption — driving measurable reliability improvements.Continuously refine observability architecture (metrics, traces, logs) to ensure comprehensive, actionable insights into production health.

Incident Response & Operational Excellence Serve as a technical authority during high-impact incidents, guiding cross-functional teams through real-time mitigation and long-term prevention.Lead blameless postmortems and translate findings into actionable reliability roadmaps.Drive reliability reviews and operational readiness assessments for all major product launches.

Performance, Scalability & Cost Efficiency Lead large-scale performance tuning and capacity engineering efforts, ensuring optimal resource utilization and cost efficiency across environments.Identify architectural bottlenecks, drive performance benchmarking, and influence platform evolution for better scalability and elasticity.

Cross-Team Leadership & Mentorship Mentor and coach SREs and software engineers, cultivating deep reliability-first thinking across teams.Serve as a thought leader in reliability engineering — driving best practices, evangelizing automation-first culture, and influencing technical standards across multiple teams.Collaborate with engineering leaders, PMs, and operations to align priorities, set strategic goals, and deliver on high-impact reliability initiatives.Lead technical deep dives and design reviews, ensuring all systems are built to scale securely and reliably.

Qualifications

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
4+ years of experience in site reliability, production engineering, or large-scale distributed system operations.
Proven track record of designing and managing highly available, globally distributed systems in** cloud-native environments (AWS, Azure, GCP).**
**Expert-level proficiency in one or more programming/scripting languages (Python, Go, Java, Bash) **for automation and tooling.
Deep understanding of Kubernetes, microservices, and service mesh architectures.
Advanced experience with Infrastructure as Code (Terraform, CloudFormation) and CI/CD automation frameworks.
Mastery in observability and monitoring stacks (Prometheus, Grafana, Datadog, OpenTelemetry).Strong expertise in networking, storage, and distributed databases (both SQL and NoSQL).
Demonstrated ability to influence architectural decisions and drive reliability strategy across organizations.Exceptional communication, leadership, and stakeholder management skills.

Preferred Qualifications

Experience designing reliability frameworks or SRE platforms at scale (error budgets, chaos engineering, reliability reviews).
Prior experience in high-traffic or latency-sensitive systems (media streaming, advertising, or real-time platforms).
Familiarity with big data ecosystems (Kafka, Spark, Hadoop) and large-scale data ingestion pipelines.Hands-on experience with security, compliance, and governance in production environments (SOC2, GDPR, ISO27001).
Cloud or Kubernetes certifications (AWS Solutions Architect Professional, CKA/CKAD, GCP Professional Cloud Architect).
Published contributions or conference talks on reliability, automation, or distributed systems.

About Adobe

Adobe empowers everyone to create through innovative platforms and tools that unleash creativity, productivity and personalized customer experiences. Adobe’s industry-leading offerings including Adobe Acrobat Studio, Adobe Express, Adobe Firefly, Creative Cloud, Adobe Experience Platform, Adobe Experience Manager, and GenStudio enable people and businesses to turn ideas into impact, powered by AI and driven by human ingenuity.

Our 30,000+ employees worldwide are creating the future and raising the bar as we drive the next decade of growth. We’re on a mission to hire the very best and believe in creating a company culture where all employees are empowered to make an impact. At Adobe, we believe that great ideas can come from anywhere in the organization. The next big idea could be yours.

** Let’s Adobe together**

At Adobe, we believe in creating a company culture where all employees are empowered to make an impact. Learn more about Adobe life, including our values and culture, focus on people, purpose and community, Adobe for All, comprehensive benefits programs, the stories we tell, the customers we serve, and how you can help us advance our mission of empowering everyone to create.

Adobe is proud to be an Equal Employment Opportunity employer. We do not discriminate based on gender, race or color, ethnicity or national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, or any other protected characteristic. Learn more.

Adobe aims to make our Careers website and recruiting process accessible to any and all users. If you have a disability or special need that requires accommodation to navigate our website or complete the application process, email accommodations@adobe.com or call +1 408-536-3015.

AI Use Guidelines for Interviews: Our interviews are designed to reflect your own skills and thinking. The use of AI or recording tools during live interviews is not permitted unless explicitly invited by the interviewer or approved in advance as part of a reasonable accommodation. If these tools are used inappropriately or in a way that misrepresents your work, your application may not move forward in the process.

At Adobe, we empower employees to innovate with AI — and we look for candidates eager to do the same. As part of the hiring experience, we provide clear guidance on where AI is encouraged during the process and where it’s restricted during live interviews. See how we think about AI in the hiring experience.

Qualifications

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
4+ years of experience in site reliability, production engineering, or large-scale distributed system operations.
Proven track record of designing and managing highly available, globally distributed systems in** cloud-native environments (AWS, Azure, GCP).**
**Expert-level proficiency in one or more programming/scripting languages (Python, Go, Java, Bash) **for automation and tooling.
Deep understanding of Kubernetes, microservices, and service mesh architectures.
Advanced experience with Infrastructure as Code (Terraform, CloudFormation) and CI/CD automation frameworks.
Mastery in observability and monitoring stacks (Prometheus, Grafana, Datadog, OpenTelemetry).Strong expertise in networking, storage, and distributed databases (both SQL and NoSQL).
Demonstrated ability to influence architectural decisions and drive reliability strategy across organizations.Exceptional communication, leadership, and stakeholder management skills.

Preferred Qualifications

Experience designing reliability frameworks or SRE platforms at scale (error budgets, chaos engineering, reliability reviews).
Prior experience in high-traffic or latency-sensitive systems (media streaming, advertising, or real-time platforms).
Familiarity with big data ecosystems (Kafka, Spark, Hadoop) and large-scale data ingestion pipelines.Hands-on experience with security, compliance, and governance in production environments (SOC2, GDPR, ISO27001).
Cloud or Kubernetes certifications (AWS Solutions Architect Professional, CKA/CKAD, GCP Professional Cloud Architect).
Published contributions or conference talks on reliability, automation, or distributed systems.

About Adobe

** Let’s Adobe together**