Manager Digital Operations Support (modern Manufacturing Digital Platform & Applications)

Caterpillar Caterpillar · Industrial · Chennai, Tamil Nadu +1

Manager Digital Operations Support for Modern Manufacturing Digital Platform (MMDP) at Caterpillar, focusing on leading 24x7 production support, platform operations, and DataOps functions to ensure high availability, reliability, performance, and data quality across manufacturing digital systems. The role involves incident/problem management, data pipeline operations, service reliability, team leadership, and continuous improvement through automation and AIOps.

What you'd actually do

  1. Lead and manage day-to-day production support across application, platform, and data layers
  2. Own end-to-end DataOps processes including data ingestion, transformation, pipelines, and data quality
  3. Act as primary escalation point for P1/P2 incidents (application, platform, and data)
  4. Define and implement monitoring and alerting strategies across applications and data pipelines
  5. Lead a distributed team of support engineers, SREs, and DataOps engineers

Skills

Required

  • Bachelor’s degree in Engineering, Computer Science, Data Engineering, or related field
  • 10–15+ years in IT operations, production support, SRE, or DataOps
  • 5+ years of people leadership experience
  • Incident and problem management
  • Data pipeline operations (ETL/ELT, streaming, batch)
  • Cloud platforms (Azure/AWS)

Nice to have

  • DataOps
  • data platforms (e.g., Kafka, Spark, Data Lake)
  • ITSM tools (ServiceNow or equivalent)
  • Microservices and distributed systems
  • Data engineering and analytics pipelines
  • DevOps, AIOps, and automation

What the JD emphasized

  • day-to-day production support
  • SLA, SLO, and KPI metrics
  • rapid incident resolution
  • operational health, system performance, and availability
  • DataOps processes
  • availability, accuracy, and timeliness of manufacturing data pipelines
  • data-related incidents
  • data observability, lineage, and monitoring capabilities
  • data reliability (SLAs) and governance
  • P1/P2 incidents
  • root cause analysis (RCA)
  • post-incident review processes
  • MTTR through automation, predictive alerts, and proactive issue detection
  • monitoring and alerting strategies
  • end-to-end observability
  • system resilience, reliability, and automation
  • distributed team of support engineers, SREs, and DataOps engineers
  • performance, productivity, and cost optimisation
  • 24x7 global support model
  • automation, AIOps, and DataOps practices
  • ITSM frameworks
  • incident trends, backlog management, and workflow efficiency
  • Engineering, Product, Data, and Plant stakeholders
  • new platform rollouts, plant onboarding, and hypercare phases
  • application, data, and architecture teams
  • SLA compliance
  • Incident volumes and trends
  • Data pipeline success rates and data quality metrics
  • MTTR / MTBF
  • System and data availability
  • operational and data excellence
  • Incident and problem management
  • Data pipeline operations (ETL/ELT, streaming, batch)
  • Cloud platforms (Azure/AWS)
  • DataOps
  • data platforms (e.g., Kafka, Spark, Data Lake)
  • ITSM tools (ServiceNow or equivalent)
  • Microservices and distributed systems
  • Data engineering and analytics pipelines
  • DevOps, AIOps, and automation