Lead Software Engr

Honeywell Honeywell · Industrial · India

Lead SRE role focused on reliability, scalability, and performance of production systems, including AI-enabled services. Responsibilities include defining SRE standards, automation, cloud infrastructure management, operationalizing ML workloads, incident management, and mentorship. Requires expertise in cloud, observability, CI/CD, and familiarity with ML pipelines and MLOps tools.

What you'd actually do

  1. Define and enforce SRE standards, SLIs/SLOs, and error budgets across critical systems.
  2. Build and scale automation frameworks for deployment, monitoring, and incident response.
  3. Lead design and optimization of hybrid cloud infrastructure (Azure, GCP) with a focus on resilience and cost efficiency.
  4. Partner with engineering teams to operationalize ML workloads, strengthen MLOps pipelines, and ensure reliability of AI‑driven services.
  5. Drive root cause analysis, postmortems, and continuous improvement for production incidents.

Skills

Required

  • SRE standards
  • SLIs/SLOs
  • error budgets
  • automation frameworks
  • deployment
  • monitoring
  • incident response
  • hybrid cloud infrastructure
  • resilience
  • cost efficiency
  • MLOps pipelines
  • reliability of AI-driven services
  • root cause analysis
  • postmortems
  • cloud architecture
  • containers
  • Kubernetes
  • serverless patterns
  • observability stacks
  • Prometheus
  • Grafana
  • ELK
  • OpenTelemetry
  • CI/CD tools
  • Terraform
  • Ansible
  • Jenkins
  • GitHub Actions
  • ML pipelines
  • MLOps tools
  • Azure ML
  • MLflow
  • Databricks
  • Python
  • Go
  • mentoring engineers
  • cross-functional partners
  • reliability culture
  • communication

Nice to have

  • AI-enabled services
  • intelligent validation

What the JD emphasized

  • AI-enabled services
  • operationalize ML workloads
  • MLOps pipelines
  • reliability of AI-driven services
  • observability, automation, and intelligent validation

Other signals

  • operationalize ML workloads
  • MLOps pipelines
  • reliability of AI-driven services
  • observability, automation, and intelligent validation into every stage of the lifecycle