Principal Site Reliability Engineer

at UiPath · Enterprise · Bangalore, India · Engineering

UiPath is seeking a Principal Site Reliability Engineer to build intelligent reliability platforms and tooling that leverage AI/ML to improve service reliability, reduce operational toil, and accelerate incident response. The role involves designing self-healing mechanisms, AI-assisted debugging tools, predictive reliability models, and AI-powered incident response systems for large-scale, cloud-native systems.

What you'd actually do

  1. Design and implement self-healing mechanisms including automated remediation workflows and intelligent retry and fallback strategies.
  2. Build internal systems that enable engineering teams to debug faster using AI-assisted tooling and proactively identify and mitigate reliability risks.
  3. Define and evolve reliability strategy using predictive reliability models(Capacity, Failure forecasting, Reliability scoring) and embed intelligent reliability practices across the engineering teams.
  4. Build AI-powered systems that determine impact and use historical data to improve detection and response over time.
  5. Influence standards for building AI-driven tooling, mentor junior and senior engineers, and elevate reliability focus across the organization.

Skills

Required

  • 7+ years of experience in SRE, Platform, Cloud infrastructure engineering roles
  • Strong conceptual understanding of distributed systems, performance bottlenecks, failure modes, and trade-offs inherent to large-scale systems.
  • Experience building applications or internal tools using LLMs to automate non-trivial workflows
  • Hands-on experience with building Agents/Copilots using modern ML frameworks (PyTorch, vLLM or equivalent) in production setting.
  • Proficiency in at least one programming language (e.g., Python, Go, or similar).
  • Experience with Infrastructure as Code (e.g., Terraform, Pulumi)
  • Experience with container orchestration (e.g., Kubernetes).
  • Hands-on experience working with one or more major cloud providers (Azure, AWS, GCP)
  • Proven experience with monitoring/observability stacks (metrics, logs, traces)
  • Experience participating in and improving incident response, blameless postmortems, and implementing systemic fixes rather than symptomatic patches.
  • Ability to partner with product, infrastructure, and engineering teams to influence architecture and reliability practices without direct authority.

Nice to have

  • practical knowledge of networking, deployments, and scaling.
  • building meaningful dashboards and alerts that improve reliability signals.

What the JD emphasized

  • AI/ML
  • AI-driven tooling
  • LLMs
  • Agents/Copilots
  • reliability

Other signals

  • building intelligent reliability platforms
  • leverage AI/ML to improve reliability
  • reduce operational toil
  • accelerate incident response
  • predictive reliability
  • self-healing capabilities
  • AI-assisted tooling
  • LLMs to automate non-trivial workflows
  • Agents/Copilots using modern ML frameworks
Read full job description

Life at UiPath

The people at UiPath believe in the transformative power of automation to change how the world works. We’re committed to creating category-leading enterprise software that unleashes that power.

To make that happen, we need people who are curious, self-propelled, generous, and genuine. People who love being part of a fast-moving, fast-thinking growth company. And people who care—about each other, about UiPath, and about our larger purpose.

Could that be you?

Your mission

UiPath is seeking a Principal Site Reliability Engineer to redefine how reliability is engineered using AI. This role focuses on building intelligent reliability platforms and tooling that leverage AI/ML to improve reliability of our services, reduce operational toil for developers, and accelerate incident response across large-scale, cloud-native systems.

You will operate at the intersection of SRE, distributed systems, and applied AI**,** designing systems that transform raw telemetry into actionable insights, enable predictive reliability, and introduce self-healing capabilities into production environments.

You will build the next generation of reliability systems, where detection, diagnosis, and remediation are increasingly automated and data driven.

You will help define how reliability is architected, scaled, measured, and automated across our large-scale, cloud-native systems. This role requires broad technical judgment, platform thinking, and the ability to influence reliability outcomes across the various engineering and platform teams.

What you'll do at UiPath

**Intelligent automation and Self-healing systems - **Design and implement self-healing mechanisms including automated remediation workflows and intelligent retry and fallback strategies.

**Reliability platform tooling - **Build internal systems that enable engineering teams to debug faster using AI-assisted tooling and proactively identify and mitigate reliability risks.

End-to-End Reliability strategy - Define and evolve reliability strategy using predictive reliability models(Capacity, Failure forecasting, Reliability scoring) and embed intelligent reliability practices across the engineering teams.

**AI-assisted Incident response & RCA - **Build AI-powered systems that determine impact and use historical data to improve detection and response over time.

Technical Leadership & Org Impact - Influence standards for building AI-driven tooling, mentor junior and senior engineers, and elevate reliability focus across the organization.

What you'll bring to the team

Engineering & Reliability Experience

• 7+ years of experience in SRE, Platform, Cloud infrastructure engineering roles with a track record of building internal tooling to improve reliability.

• Strong conceptual understanding of distributed systems, performance bottlenecks, failure modes, and trade-offs inherent to large-scale systems.

AI/ML Application to systems & operations

• Experience building applications or internal tools using LLMs to automate non-trivial workflows (e.g., AIOps, Automated code reviews, Automated flagging of reliability risks)

• Hands-on experience with building Agents/Copilots using modern ML frameworks (PyTorch, vLLM or equivalent) in production setting.

Scripting & Tooling

• Proficiency in at least one programming language (e.g., Python, Go, or similar). Experience with Infrastructure as Code (e.g., Terraform, Pulumi) and container orchestration (e.g., Kubernetes).

Cloud & Infrastructure Expertise

• Hands-on experience working with one or more major cloud providers (Azure, AWS, GCP), with practical knowledge of networking, deployments, and scaling.

Observability & Operational Practices

• Proven experience with monitoring/observability stacks (metrics, logs, traces) and building meaningful dashboards and alerts that improve reliability signals.

Incident Response & Post-Incident Learning

• Experience participating in and improving incident response, blameless postmortems, and implementing systemic fixes rather than symptomatic patches.

Collaboration & Influence

• Ability to partner with product, infrastructure, and engineering teams to influence architecture and reliability practices without direct authority.

#LI-VR1

Maybe you don’t tick all the boxes above—but still think you’d be great for the job? Go ahead, apply anyway. Please. Because we know that experience comes in all shapes and sizes—and passion can’t be learned.

Many of our roles allow for flexibility in when and where work gets done. Depending on the needs of the business and the role, the number of hybrid, office-based, and remote workers will vary from team to team. Applications are assessed on a rolling basis and there is no fixed deadline for this requisition. The application window may change depending on the volume of applications received or may close immediately if a qualified candidate is selected.

We value a range of diverse backgrounds, experiences and ideas. We pride ourselves on our diversity and inclusive workplace that provides equal opportunities to all persons regardless of age, race, color, religion, sex, sexual orientation, gender identity, and expression, national origin, disability, neurodiversity, military and/or veteran status, or any other protected classes. Additionally, UiPath provides reasonable accommodations for candidates on request and respects applicants' privacy rights. To review these and other legal disclosures, visit our privacy policy.