Principal Platform Reliability Engineer

Eli Lilly Eli Lilly · Pharma · Indianapolis, IN

Principal Platform Reliability Engineer responsible for designing, operating, and improving highly available, scalable, and fault-tolerant systems across cloud environments. Focuses on establishing reliability standards, driving operational excellence, and enabling engineering teams. Key responsibilities include defining SLOs/SLIs, leading incident response, developing operational standards, implementing observability frameworks, building CI/CD pipelines, and ensuring security and compliance.

What you'd actually do

  1. Define and implement SLOs, SLIs, and reliability standards that establish a consistent foundation for platform health, driving resilience through capacity planning, failover design, and disaster recovery strategies
  2. Lead response for P1/P2 incidents, owning rapid mitigation and recovery while conducting thorough root cause analysis and implementing corrective actions that prevent recurrence
  3. Develop and maintain runbooks, playbooks, and operational standards that enable the broader engineering organization to respond effectively and consistently
  4. Implement and optimize observability frameworks spanning monitoring, logging, tracing, and alerting — improving system visibility and reducing alert noise through actionable, signal-driven insights
  5. Leverage platforms such as Splunk, Prometheus, CloudWatch, or equivalent tooling to ensure teams have the telemetry they need to detect, diagnose, and resolve issues proactively

Skills

Required

  • Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related technical field
  • 7+ years of hands-on experience with AWS
  • Extensive experience with Kubernetes and containerization technologies (Docker, EKS, etc.)
  • Experience operating production-grade distributed systems
  • Experience in incident management and on-call support models
  • Experience defining and managing SLOs, SLIs, and error budgets
  • Hands-on experience with observability tools such as Splunk and the LGTM stack
  • Experience building and maintaining CI/CD pipelines
  • Proficient Experience in Infrastructure as Code tools (Terraform, CloudFormation, etc.)
  • Experience with scripting in Python, Bash, or PowerShell
  • Experience with networking and cloud architecture fundamentals
  • Experience implementing security best practices in cloud environments
  • Experience troubleshooting complex system and performance issues

Nice to have

  • Experience with tools such as ArgoCD, GitHub Actions, or GitOps workflows
  • Familiarity with large-scale enterprise platforms and environments
  • Experience in regulated industries such as healthcare or pharma
  • Exposure to global support models and follow-the-sun operations
  • Strong written communication skills, including crafting incident updates, postmortems, and status summaries for mixed audiences

What the JD emphasized

  • 7+ years of hands-on experience with AWS
  • Experience in regulated industries such as healthcare or pharma