Lead Technology Specialist(lead Site Re… at Caterpillar

What you'd actually do

Provision, configure, and maintain Kubernetes clusters on on‑premises infrastructure (bare metal or virtualized) and in AWS (e.g., EKS).

Implement and manage Infrastructure as Code (IaC) and automated workflows for cluster creation, upgrades, and application deployments (e.g., Terraform, Ansible, Helm, Git‑based pipelines).

Establish and operate comprehensive observability (metrics, logs, traces), including SLI/SLO definitions, alerting, dashboards, and runbooks for platform and key services.

Monitor environment health (control plane and node components), capacity, performance, and cost; perform tuning and right‑sizing across on‑prem and cloud.

Execute bug triage: reproduce issues, collect diagnostics, perform root‑cause analysis, and coordinate fixes with platform/application teams and vendors.

Skills

Required

Kubernetes administration and operations on on‑premises and AWS environments (cluster lifecycle, upgrades, node management, workload scheduling).
Infrastructure as Code and automation and Git‑based CI/CD.
Observability stacks and tooling (e.g., Prometheus, Grafana, Alertmanager, OpenTelemetry; ELK/Loki‑class logging).
Linux systems administration (container runtime, networking, storage.
Networking fundamentals applied to Kubernetes (CNI, DNS, Ingress/Load Balancing, TLS/cert management, basic L3/L4 concepts).
Security best practices (RBAC, pod security standards, network policies, image scanning, secrets management).
Experience with incident response, on‑call participation, and root‑cause analysis in production environments.
Strong documentation and communication skills; ability to work effectively with geographically distributed teams.

Nice to have

Experience with service mesh (e.g., Istio/Linkerd) and advanced container networking (e.g., eBPF‑based data paths, network policy engines).
Familiarity with backup/DR tooling for Kubernetes (e.g., Velero) and stateful workload recovery.
Exposure to Operational Technology (OT) or edge/remote site constraints and ruggedized deployments.
Experience with configuration compliance, policy‑as‑code (e.g., Open Policy Agent), and supply‑chain security.
Knowledge of platform registry operations, image lifecycle, and vulnerability management.

Career Area:

Technology, Digital and Data

Job Description:

**Your Work Shapes the World at Caterpillar Inc. **

When you join Caterpillar, you're joining a global team who cares not just about the work we do – but also about each other. We are the makers, problem solvers, and future world builders who are creating stronger, more sustainable communities. We don't just talk about progress and innovation here – we make it happen, with our customers, where we work and live. Together, we are building a better world, so we can all enjoy living in it.

Your Impact Shapes the World at Caterpillar Inc

When you join Caterpillar, you're joining a global team who cares not just about the work we do – but also about each other. We are the makers, problem solvers and future world builders who are creating stronger, more sustainable communities. We don't just talk about progress and innovation here – we make it happen, with our customers, where we work and live. Together, we are building a better world, so we can all enjoy living in it.

Job Summary

We are seeking a skilled Lead Technology Specialist(**Lead Site Reliability Engineer) **to join the Cat Technology GCIO IT Division.

Come work on the Caterpillar IT Team as a **Lead Technology Specialist ** supporting Caterpillar's Autonomy & Autonomous Business Unit.

The Autonomy and Automation team is focused on scaling technology solutions in mining, construction, quarry and aggregates and beyond to support customer safety and productivity goals. A&A is responsible for technology solutions including autonomy, semi-autonomy, remote control, and other technologies. The goal is to address key customer problems, including safety, productivity, labor shortage, energy transition and process optimization. In this role as Lead Site Reliability Engineer, you will provide end to end operational ownership of Kubernetes based platform environments deployed on on premises hardware and in AWS. Ensure reliable provisioning, configuration, monitoring, and continuous improvement of clusters and workloads. Perform bug triage and incident response, drive observability and automation, and partner with platform, networking, and application teams to meet reliability objectives and business needs.

The preference for this role is to be based out of Whitefield PSN Office -Bangalore, KA Or Chennai WTC Centre , TN -India

What you will do

Provision, configure, and maintain Kubernetes clusters on on‑premises infrastructure (bare metal or virtualized) and in AWS (e.g., EKS).
Implement and manage Infrastructure as Code (IaC) and automated workflows for cluster creation, upgrades, and application deployments (e.g., Terraform, Ansible, Helm, Git‑based pipelines).
Establish and operate comprehensive observability (metrics, logs, traces), including SLI/SLO definitions, alerting, dashboards, and runbooks for platform and key services.
Monitor environment health (control plane and node components), capacity, performance, and cost; perform tuning and right‑sizing across on‑prem and cloud.
Execute bug triage: reproduce issues, collect diagnostics, perform root‑cause analysis, and coordinate fixes with platform/application teams and vendors.
Lead incident response for reliability events (degradations, outages), post‑incident reviews, and preventive actions.
Administer Kubernetes security controls (RBAC, network policies, secrets management, image signing/scanning), certificate management, and compliance control implementation.
Manage platform services (container registry, ingress/controllers, CNI, storage classes/CSI, service mesh where applicable).
Implement backup/restore and disaster recovery strategies for clusters and stateful workloads (e.g., Velero), validate regularly.
Maintain and improve CI/CD workflows integrating testing, policy checks, and progressive delivery for platform and shared services.
Create and maintain operational documentation: standards, diagrams, runbooks, automation playbooks, and knowledge base articles.
Collaborate with networking, security, and application teams to ensure reliability, performance, and secure connectivity across data centers and AWS.
Drive continuous improvement: reliability engineering practices, toil reduction, automation, and change management processes.

** What you will have **

Kubernetes administration and operations on on‑premises and AWS environments (cluster lifecycle, upgrades, node management, workload scheduling).
Infrastructure as Code and automation and Git‑based CI/CD.
Observability stacks and tooling (e.g., Prometheus, Grafana, Alertmanager, OpenTelemetry; ELK/Loki‑class logging).
Linux systems administration (container runtime, networking, storage.
Networking fundamentals applied to Kubernetes (CNI, DNS, Ingress/Load Balancing, TLS/cert management, basic L3/L4 concepts).
Security best practices (RBAC, pod security standards, network policies, image scanning, secrets management).
Experience with incident response, on‑call participation, and root‑cause analysis in production environments.
Strong documentation and communication skills; ability to work effectively with geographically distributed teams.

Top Candidates Will Also Have:

Experience with service mesh (e.g., Istio/Linkerd) and advanced container networking (e.g., eBPF‑based data paths, network policy engines).
Familiarity with backup/DR tooling for Kubernetes (e.g., Velero) and stateful workload recovery.
Exposure to Operational Technology (OT) or edge/remote site constraints and ruggedized deployments.
Experience with configuration compliance, policy‑as‑code (e.g., Open Policy Agent), and supply‑chain security.
Knowledge of platform registry operations, image lifecycle, and vulnerability management.
This position requires candidate to work a 5-day -a -week schedule in the office

Skills desired:

Technical Excellence: Knowledge of a given technology and various application methods; ability to develop and provide solutions to significant technical challenges. Level Extensive Experience: • Advises others on the assessment and provision of all technical solutions. • Engages appropriate subject matter resources to effectively resolve technical issues. • Mentors others to enhance their technical competence and its application to achieve more effective technical solutions. • Coaches others in promoting, defining, analyzing, and providing superior technical solutions to business problems. • Provides effective solutions to moderate technical challenges through strong technical competence, effectively examining implications of events and issues. • Assumes accountability for personal technical performance and holds others responsible for theirs.

Technology Advising: Knowledge of effective advisory methods and ability to provide valued information and advice to clients regarding products, technologies, services and solutions for a specific technology domain. Level Working Knowledge: • Assesses the current technology environment, expressed needs and initiatives of client organizations. • Uses an effective consulting method to present technology solutions that resolve stated client business issues. • Advises clients regarding a family of specific products, technologies or services in a technology domain. • Demonstrates basic competence and sound business knowledge regarding specific products, technologies or services within a domain of technology expertise. • Achieves consulting relationship rating of 'professional' by delivering timely, meaningful advice meeting client needs in a narrow set of specific technologies.

Hardware Infrastructure: Knowledge of computer architecture and systems programming; ability to design, build and integrate IT hardware into multi-platforms for the organization. Level Extensive Experience: • Evaluates IT hardware vendors in the market and selects the most suitable products for the organization. • Guides employees on the integration of IT hardware throughout other organization-wide platforms. • Supervises the implementation process of IT hardware ensuring consistency in productivity and overall effectiveness. • Advises others on business standards and practices for IT hardware in order to meet designer requirements. • Evaluates the advantages and disadvantages of an organization's hardware components. • Diagnoses IT hardware problems and recommends dynamic solutions.

Requirements Analysis: Knowledge of tools, methods, and techniques of requirement analysis; ability to elicit, analyze and record required business functionality and non-functionality requirements to ensure the success of a system or software development project. Level Working Knowledge: • Follows policies, practices and standards for determining functional and informational requirements. • Confirms deliverables associated with requirements analysis. • Communicates with customers and users to elicit and gather client requirements. • Participates in the preparation of detailed documentation and requirements. • Utilizes specific organizational methods, tools and techniques for requirements analysis.

System Testing: Knowledge of system and software testing; ability to design, plan and execute system testing strategies and tactics to ensure the quality of software at all stages of the system life cycle. Level Extensive Experience: • Verifies the proper flow of transactions across all input, output and storage channels or devices. • Evaluates interoperability of new systems with existing systems during the beta testing phase. • Supervises the testing of complex, multi-platform and distributed applications. • Designs processes to ensure that the system meets and maintains requirements and expectations. • Coaches end users on the development of test data and test scenarios for system validation. • Manages the execution of test plans, including resources, strategies, schedules, processes and tools.

Systems Software Infrastructure: Knowledge of computer architecture and system software interaction; ability to design and build a fundamental architecture of operating systems, database management systems, communications protocols, compilers and other development tools. Level Working Knowledge: • Reports software connectivity and integration issues. • Demonstrates planned software changes on the local environment. • Administers software migration and contingency plans related to own function. • Analyzes the local software architecture components and products. • Tests key features for the entire software infrastructure environment.

Technical Troubleshooting: Knowledge of technical troubleshooting approaches, tools and techniques; ability to anticipate, recognize, and resolve technical issues on hardware, software, application or operation. Level Extensive Experience: • Emphasizes the business impact of failure and the criticality and timing of needed resolution so that problems can be avoided in the future. • Creates trouble reports for all issues found and reviews solutions for completeness and correctness. • Directs the resolution of communications problems in multi-vendor environments. • Resolves a variety of hardware, software, and communications malfunctions. • Coaches others on advanced diagnostic techniques and tools for unusual or performance-related problems. • Facilitates the distribution of releases reports and correction packages to departments or clients.

Technical Writing/Documentation: Knowledge of technical writing; ability to write technical documents such as manuals, reports, guidelines or documents on standards, processes and applications. Level Extensive Experience: • Conducts training on alternative documentation delivery mechanisms, tools and techniques. • Manages cost items in producing and maintaining documentation. • Designs and implements formal methodologies for producing documentation. • Collaborates with support function managers, the product management team, and design engineers with writing projects. • Supervises the analysis, design and data collation on large documentation initiatives. • Establishes and references best practices for existing and planned tools and delivery vehicles for proper documentation.

What you will get:

Work Life Harmony
Earned and medical leave.
Relocation assistance

Holistic Development

Personal and professional development through Caterpillar ‘s employee resource groups across the globe
Career developments opportunities with global prospects

Health and Wellness

Medical coverage -Medical, life and personal accident coverage
Employee mental wellness assistance program

Financial Wellness

Employee investment plan
Pay for performance -Annual incentive Bonus plan.

Additional Information:

Caterpillar is not currently hiring individuals for this position who now or in the future require sponsorship for employment visa status; however, as a global company, Caterpillar offers many job opportunities outside of the U.S. which can be found through our employment website at www.caterpillar.com/careers

This position requires working onsite five days a week.

Visa Sponsorship is not available for this position.

Posting Dates:

May 13, 2026 - May 19, 2026

Caterpillar is an Equal Opportunity Employer. Qualified applicants of any age are encouraged to apply

Not ready to apply? Join our Talent Community.