Lead Software Engr at Honeywell

What you'd actually do

Define and enforce SRE standards, SLIs/SLOs, and error budgets across critical systems.

Build and scale automation frameworks for deployment, monitoring, and incident response.

Lead design and optimization of hybrid cloud infrastructure (Azure, GCP) with a focus on resilience and cost efficiency.

Partner with engineering teams to operationalize ML workloads, strengthen MLOps pipelines, and ensure reliability of AI‑driven services.

Drive root cause analysis, postmortems, and continuous improvement for production incidents.

Skills

Required

SRE standards
SLIs/SLOs
error budgets
automation frameworks
deployment
monitoring
incident response
hybrid cloud infrastructure
resilience
cost efficiency
MLOps pipelines
reliability of AI-driven services
root cause analysis
postmortems
cloud architecture
containers
Kubernetes
serverless patterns
observability stacks
Prometheus
Grafana
ELK
OpenTelemetry
CI/CD tools
Terraform
Ansible
Jenkins
GitHub Actions
ML pipelines
MLOps tools
Azure ML
MLflow
Databricks
Python
Go
mentoring engineers
cross-functional partners
reliability culture
communication

Nice to have

AI-enabled services
intelligent validation

The Lead Site Reliability Engineer (Lead SRE) is responsible for driving reliability, scalability, and performance across Honeywell’s production systems. This role bridges software engineering and operations, ensuring that cloud‑native platforms and AI‑enabled services are resilient, secure, and cost‑optimized. The Lead SRE will mentor engineers, establish reliability best practices, and partner with product and engineering teams to embed observability, automation, and intelligent validation into every stage of the lifecycle.

Reliability Strategy & Leadership: Define and enforce SRE standards, SLIs/SLOs, and error budgets across critical systems.
Automation & Tooling: Build and scale automation frameworks for deployment, monitoring, and incident response.
Cloud & Infrastructure: Lead design and optimization of hybrid cloud infrastructure (Azure, GCP) with a focus on resilience and cost efficiency.
AI/ML Readiness: Partner with engineering teams to operationalize ML workloads, strengthen MLOps pipelines, and ensure reliability of AI‑driven services.
Incident Management: Drive root cause analysis, postmortems, and continuous improvement for production incidents.
Mentorship & Collaboration: Guide SRE and engineering teams, fostering a culture of ownership, learning, and proactive reliability practices.
Governance & Security: Ensure compliance, observability, and responsible use of automation and AI in production systems.
Education: Bachelor’s or Master’s in Computer Science, Engineering, or related field.
Experience: 12+ years in software engineering or operations, with 3–5 years in SRE leadership. Proven experience managing large‑scale distributed systems and cloud infrastructure.
Technical Skills:
- Expertise in cloud architecture, containers, Kubernetes, serverless patterns.
- Strong knowledge of observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).
- Proficiency in automation and CI/CD tools (Terraform, Ansible, Jenkins, GitHub Actions).
- Familiarity with ML pipelines and MLOps tools (Azure ML, MLflow, Databricks).
- Programming skills in Python, or Go
Leadership Skills: Ability to mentor engineers, influence cross‑functional partners, and drive reliability culture. Strong communicator with executive presence.

Reliability Strategy & Leadership: Define and enforce SRE standards, SLIs/SLOs, and error budgets across critical systems.
Automation & Tooling: Build and scale automation frameworks for deployment, monitoring, and incident response.
Cloud & Infrastructure: Lead design and optimization of hybrid cloud infrastructure (Azure, GCP) with a focus on resilience and cost efficiency.
AI/ML Readiness: Partner with engineering teams to operationalize ML workloads, strengthen MLOps pipelines, and ensure reliability of AI‑driven services.
Incident Management: Drive root cause analysis, postmortems, and continuous improvement for production incidents.
Mentorship & Collaboration: Guide SRE and engineering teams, fostering a culture of ownership, learning, and proactive reliability practices.
Governance & Security: Ensure compliance, observability, and responsible use of automation and AI in production systems.
Education: Bachelor’s or Master’s in Computer Science, Engineering, or related field.
Experience: 12+ years in software engineering or operations, with 3–5 years in SRE leadership. Proven experience managing large‑scale distributed systems and cloud infrastructure.
Technical Skills:
- Expertise in cloud architecture, containers, Kubernetes, serverless patterns.
- Strong knowledge of observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).
- Proficiency in automation and CI/CD tools (Terraform, Ansible, Jenkins, GitHub Actions).
- Familiarity with ML pipelines and MLOps tools (Azure ML, MLflow, Databricks).
- Programming skills in Python, or Go
Leadership Skills: Ability to mentor engineers, influence cross‑functional partners, and drive reliability culture. Strong communicator with executive presence.

Lead Software Engr

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals