Lead Software Engineer - Site Reliability

Freshworks Freshworks · Enterprise · Chennai, India · Other

Lead Site Reliability Engineer (SRE) at Freshworks, focusing on designing for resilience, automating recovery, and ensuring system stability, performance, and observability at scale. This role involves partnering with engineering, platform, and product teams to improve reliability, define SLIs/SLOs, manage error budgets, lead incident response, and champion observability across services. Requires strong coding proficiency, Linux expertise, experience with Docker, Kubernetes, CI/CD, IaC, and distributed systems design, with a focus on high availability and scalability.

What you'd actually do

  1. Design and implement tools to improve availability, latency, scalability, and system health.
  2. Define SLIs/SLOs, manage error budgets, and drive performance engineering efforts.
  3. Build and maintain automated monitoring, alerting, and remediation pipelines.
  4. Collaborate with engineering teams to improve reliability by design.
  5. Lead incident response, root cause analysis, and blameless postmortems.

Skills

Required

  • SRE
  • DevOps
  • Production Engineering
  • Coding
  • Linux
  • Docker
  • Kubernetes
  • CI/CD
  • Infrastructure as Code (IaC)
  • Disaster Recovery (DR)
  • High Availability (HA)
  • Monitoring
  • Logging
  • Tracing
  • Distributed Systems Design
  • Troubleshooting

Nice to have

  • Computer Science Degree
  • Engineering Degree
  • Chaos Engineering

What the JD emphasized

  • 7–12 years of experience in SRE, DevOps, or Production Engineering roles
  • Coding Proficiency
  • Linux Expertise
  • Containerization & Orchestration
  • CI/CD Management
  • Security & Compliance
  • High Availability & Scalability
  • Infrastructure as Code (IaC) & Automation
  • Disaster Recovery (DR) & High Availability (HA)
  • Observability
  • System Design (Distributed Systems)
  • Problem-Solving & Troubleshooting
  • building and scaling services in production with high uptime targets (99.99%+)
  • reducing incident frequency and improving response metrics (MTTD/MTTR)