Senior Manager Software Engineering-dataops/mlops/aiops/devops

Caterpillar Caterpillar · Industrial · Chennai, Tamil Nadu

Senior Engineering Manager to lead DataOps, MLOps, AIOps, and DevOps capabilities, focusing on operationalizing data and AI at enterprise scale. The role involves driving the design, development, deployment, and intelligent operation of data platforms, ML systems, and cloud-native infrastructure, ensuring reliability, performance, security, and continuous optimization. Key responsibilities include strategic direction, platform strategy, ML lifecycle management, CI/CD, SRE principles, AI Ops adoption for proactive monitoring and remediation, and ensuring governance, security, and compliance.

What you'd actually do

  1. Provide strategic direction and technical leadership across Data Ops, ML Ops, DevOps, and AI Ops, fostering a culture of engineering excellence, automation, and operational rigor.
  2. Define and execute the end-to-end platform strategy spanning data pipelines, ML lifecycle, CI/CD, infrastructure, and intelligent operations.
  3. Architect and scale cloud-native data platforms supporting real-time and batch ingestion, transformation, analytics, and AI workloads.
  4. Drive ML Ops best practices for model training, deployment, monitoring, retraining, and governance across the full model lifecycle.
  5. Lead the adoption of AI Ops capabilities for proactive monitoring, anomaly detection, incident correlation, root cause analysis, and predictive remediation.

Skills

Required

  • 15+ years of experience in software, data, or platform engineering
  • 5+ years in senior engineering leadership roles
  • Strong expertise across Data Engineering, ML Ops, DevOps, and production platform operations
  • Hands-on experience with cloud platforms (AWS, Azure, or GCP)
  • Hands-on experience with container orchestration (Docker, Kubernetes)
  • Proven experience with CI/CD pipelines
  • Proven experience with infrastructure-as-code (Terraform, ARM, CloudFormation)
  • Proven experience with automation frameworks
  • Solid understanding of streaming and data platforms (Kafka, Spark, Flink)
  • Solid understanding of ML Ops tooling (MLflow, Kubeflow, SageMaker)
  • Experience driving platform reliability, security, governance, and compliance at enterprise scale
  • Strong leadership, communication, and stakeholder management skills

Nice to have

  • Experience with AI Ops platforms, intelligent observability, and incident automation
  • Exposure to feature stores, model registries, real-time inference, and event-driven architectures
  • Knowledge of SRE practices, error budgets, and resilience engineering
  • Familiarity with GPU acceleration, distributed training, and high-performance computing
  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry)
  • Experience with log analytics platforms
  • Contributions to open-source projects or published work in data platforms, ML Ops, DevOps, or AI Ops

What the JD emphasized

  • operationalizing data and AI at scale
  • AI Ops capabilities
  • ML Ops best practices
  • end-to-end platform strategy
  • AI, DevOps, and cloud platforms

Other signals

  • operationalizing data and AI at scale
  • ML Ops best practices
  • AI Ops capabilities