Lead Technology Specialist(lead Site Reliability Engineer)

Caterpillar · Industrial · Bangalore, Karnataka +1

Lead Site Reliability Engineer responsible for the operational ownership of Kubernetes-based platform environments on-premises and in AWS. Focuses on provisioning, configuration, monitoring, incident response, and automation to ensure platform reliability and performance for Caterpillar's Autonomy & Autonomous Business Unit.

What you'd actually do

  1. Provision, configure, and maintain Kubernetes clusters on on‑premises infrastructure (bare metal or virtualized) and in AWS (e.g., EKS).
  2. Implement and manage Infrastructure as Code (IaC) and automated workflows for cluster creation, upgrades, and application deployments (e.g., Terraform, Ansible, Helm, Git‑based pipelines).
  3. Establish and operate comprehensive observability (metrics, logs, traces), including SLI/SLO definitions, alerting, dashboards, and runbooks for platform and key services.
  4. Monitor environment health (control plane and node components), capacity, performance, and cost; perform tuning and right‑sizing across on‑prem and cloud.
  5. Execute bug triage: reproduce issues, collect diagnostics, perform root‑cause analysis, and coordinate fixes with platform/application teams and vendors.

Skills

Required

  • Kubernetes administration and operations on on‑premises and AWS environments (cluster lifecycle, upgrades, node management, workload scheduling).
  • Infrastructure as Code and automation and Git‑based CI/CD.
  • Observability stacks and tooling (e.g., Prometheus, Grafana, Alertmanager, OpenTelemetry; ELK/Loki‑class logging).
  • Linux systems administration (container runtime, networking, storage.
  • Networking fundamentals applied to Kubernetes (CNI, DNS, Ingress/Load Balancing, TLS/cert management, basic L3/L4 concepts).
  • Security best practices (RBAC, pod security standards, network policies, image scanning, secrets management).
  • Experience with incident response, on‑call participation, and root‑cause analysis in production environments.
  • Strong documentation and communication skills; ability to work effectively with geographically distributed teams.

Nice to have

  • Experience with service mesh (e.g., Istio/Linkerd) and advanced container networking (e.g., eBPF‑based data paths, network policy engines).
  • Familiarity with backup/DR tooling for Kubernetes (e.g., Velero) and stateful workload recovery.
  • Exposure to Operational Technology (OT) or edge/remote site constraints and ruggedized deployments.
  • Experience with configuration compliance, policy‑as‑code (e.g., Open Policy Agent), and supply‑chain security.
  • Knowledge of platform registry operations, image lifecycle, and vulnerability management.