Production Engineer/site Reliability Engineer (shift Basis)

Rubrik Rubrik · Enterprise · Bangalore, India · Engineering

This role is for a Production Engineer/Site Reliability Engineer on a 24/7 team responsible for managing and supporting critical infrastructure and services in multi-cloud environments. Key responsibilities include overseeing staging and production environments, implementing observability solutions, leading incident management, analyzing incidents to reduce toil and improve resilience, and designing automation tools for issue detection and remediation. Requires solid understanding of distributed systems, production environments, Kubernetes, infrastructure management tools (CloudFormation, Terraform), strong analytical skills, proficiency in Python, and knowledge of data structures, algorithms, UNIX, networking, operating systems, and databases.

What you'd actually do

  1. Join a 24/7 Production Operations team responsible for managing and supporting critical infrastructure and services in multi-cloud environments.
  2. Oversee staging and production environments to ensure maximum uptime and reliability.
  3. Implement and maintain comprehensive observability solutions for real-time monitoring, alerting, and metrics collection.
  4. Lead incident management efforts by swiftly responding to alerts and outages, coordinating teams to drive timely resolution.
  5. Analyze recurring incidents to identify root causes, reduce toil, and improve system resilience.

Skills

Required

  • distributed system concepts
  • production systems and environments
  • public cloud infrastructures
  • container orchestration platforms
  • Kubernetes
  • infrastructure management tools
  • CloudFormation
  • Terraform
  • analytical and problem-solving skills
  • diagnosing and resolving system and application issues
  • data structures
  • algorithms
  • UNIX
  • networking
  • operating systems
  • database systems
  • MySQL
  • Python programming skills
  • verbal and written communication skills