What you'd actually do

Define and implement SLOs, SLIs, and reliability standards that establish a consistent foundation for platform health, driving resilience through capacity planning, failover design, and disaster recovery strategies

Lead response for P1/P2 incidents, owning rapid mitigation and recovery while conducting thorough root cause analysis and implementing corrective actions that prevent recurrence

Develop and maintain runbooks, playbooks, and operational standards that enable the broader engineering organization to respond effectively and consistently

Implement and optimize observability frameworks spanning monitoring, logging, tracing, and alerting — improving system visibility and reducing alert noise through actionable, signal-driven insights

Leverage platforms such as Splunk, Prometheus, CloudWatch, or equivalent tooling to ensure teams have the telemetry they need to detect, diagnose, and resolve issues proactively

Skills

Required

Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related technical field
7+ years of hands-on experience with AWS
Extensive experience with Kubernetes and containerization technologies (Docker, EKS, etc.)
Experience operating production-grade distributed systems
Experience in incident management and on-call support models
Experience defining and managing SLOs, SLIs, and error budgets
Hands-on experience with observability tools such as Splunk and the LGTM stack
Experience building and maintaining CI/CD pipelines
Proficient Experience in Infrastructure as Code tools (Terraform, CloudFormation, etc.)
Experience with scripting in Python, Bash, or PowerShell
Experience with networking and cloud architecture fundamentals
Experience implementing security best practices in cloud environments
Experience troubleshooting complex system and performance issues

Nice to have

Experience with tools such as ArgoCD, GitHub Actions, or GitOps workflows
Familiarity with large-scale enterprise platforms and environments
Experience in regulated industries such as healthcare or pharma
Exposure to global support models and follow-the-sun operations
Strong written communication skills, including crafting incident updates, postmortems, and status summaries for mixed audiences

At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We’re looking for people who are determined to make life better for people around the world.

Eli Lilly and Company seeks a Platform Site Reliability Engineer to join the Software Product Engineering (SPE) Customer Operations team. You will design, operate, and continuously improve highly available, scalable, and fault-tolerant systems across cloud environments. You will play a critical role in establishing reliability standards, driving operational excellence, and enabling engineering teams to build and deploy with confidence.

What You’ll Do:

Define and implement SLOs, SLIs, and reliability standards that establish a consistent foundation for platform health, driving resilience through capacity planning, failover design, and disaster recovery strategies
Lead response for P1/P2 incidents, owning rapid mitigation and recovery while conducting thorough root cause analysis and implementing corrective actions that prevent recurrence
Develop and maintain runbooks, playbooks, and operational standards that enable the broader engineering organization to respond effectively and consistently
Implement and optimize observability frameworks spanning monitoring, logging, tracing, and alerting — improving system visibility and reducing alert noise through actionable, signal-driven insights
Leverage platforms such as Splunk, Prometheus, CloudWatch, or equivalent tooling to ensure teams have the telemetry they need to detect, diagnose, and resolve issues proactively
Build and maintain CI/CD pipelines and deployment automation; drive adoption of Infrastructure as Code and GitOps practices across engineering teams
Support engineering teams in integrating SRE principles throughout the software lifecycle
Implement secure-by-design practices across infrastructure and platforms, support vulnerability remediation and secure configurations, and ensure alignment with enterprise security and compliance standards
Partner with engineering teams to improve reliability, performance, and deployment practices
Provide technical guidance and mentorship to engineers, and communicate system health and incident impact clearly to stakeholders at all levels

Your Basic Requirements:

Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related technical field
7+ years of hands-on experience with AWS
Extensive experience with Kubernetes and containerization technologies (Docker, EKS, etc.)
Experience operating production-grade distributed systems
Experience in incident management and on-call support models
Experience defining and managing SLOs, SLIs, and error budgets
Hands-on experience with observability tools such as Splunk and the LGTM stack
Experience building and maintaining CI/CD pipelines
Proficient Experience in Infrastructure as Code tools (Terraform, CloudFormation, etc.)
Experience with scripting in Python, Bash, or PowerShell
Experience with networking and cloud architecture fundamentals
Experience implementing security best practices in cloud environments
Experience troubleshooting complex system and performance issues

What You Should Bring:

Experience with tools such as ArgoCD, GitHub Actions, or GitOps workflows
Familiarity with large-scale enterprise platforms and environments
Experience in regulated industries such as healthcare or pharma
Exposure to global support models and follow-the-sun operations
Strong written communication skills, including crafting incident updates, postmortems, and status summaries for mixed audiences

This role is hybrid, in office 3 days a week, and does not require travel.

Lilly is dedicated to helping individuals with disabilities to actively engage in the workforce, ensuring equal opportunities when vying for positions. If you require accommodation to submit a resume for a position at Lilly, please complete the accommodation request form (https://careers.lilly.com/us/en/workplace-accommodation) for further assistance. Please note this is for individuals to request an accommodation as part of the application process and any other correspondence will not receive a response.

Lilly is proud to be an EEO Employer and does not discriminate on the basis of age, race, color, religion, gender identity, sex, gender expression, sexual orientation, genetic information, ancestry, national origin, protected veteran status, disability, or any other legally protected status.

Our employee resource groups (ERGs) offer strong support networks for their members and are open to all employees. Our current groups include: Africa, Middle East, Central Asia Network, Black Employees at Lilly, Chinese Culture Network, Japanese International Leadership Network (JILN), Lilly India Network, Organization of Latinx at Lilly (OLA), PRIDE (LGBTQ+ Allies), Veterans Leadership Network (VLN), Women’s Initiative for Leading at Lilly (WILL), enAble (for people with disabilities). Learn more about all of our groups.

Actual compensation will depend on a candidate’s education, experience, skills, and geographic location. The anticipated wage for this position is

$126,000 - $224,400

Full-time equivalent employees also will be eligible for a company bonus (depending, in part, on company and individual performance). In addition, Lilly offers a comprehensive benefit program to eligible employees, including eligibility to participate in a company-sponsored 401(k); pension; vacation benefits; eligibility for medical, dental, vision and prescription drug benefits; flexible benefits (e.g., healthcare and/or dependent day care flexible spending accounts); life insurance and death benefits; certain time off and leave of absence benefits; and well-being benefits (e.g., employee assistance program, fitness benefits, and employee clubs and activities).Lilly reserves the right to amend, modify, or terminate its compensation and benefit programs in its sole discretion and Lilly’s compensation practices and guidelines will apply regarding the details of any promotion or transfer of Lilly employees.

#WeAreLilly