What you'd actually do

Design and build AIOps models using LLMs or classical ML for anomaly detection, correlation, root-cause identification, and intelligent event clustering.

Develop operational copilots and chatbots capable of responding to incidents, surfacing insights, and driving automation through natural language.

Build knowledge-grounding systems for operational copilots using runbooks, incident data, historical patterns, service maps, and topology.

Build automated workflows for incident triage, diagnostics, collaboration, and remediation.

Integrate AIOps models with observability platforms handling logs, metrics, traces, events, and topology data.

Skills

Required

System Design
Platform & reliability engineering
ML engineering
Data engineering
AIOps
Python
Java
PyTorch
TensorFlow
Modern LLM frameworks
Automation workflows
StackStorm
Rundeck
Airflow
Jenkins
Cloud-native orchestration platforms
Observability data (logs, metrics, traces)
Observability platforms (Datadog, Splunk, Prometheus, Grafana, ELK)
RAG pipelines
Embeddings
Intent models
Operational chatbots
Streaming or event-driven systems (Kafka, Kinesis, Pub/Sub)
Cloud-native systems
Kubernetes
Microservices
Modern deployment patterns
Claude Code
Codex
GitHub Copilot
Context engineering
Agentic harness frameworks
MCP server

Nice to have

Translate operational challenges into ML-based or automation-based solutions
Collaborate effectively across SRE, platform, service management, and engineering teams

What the JD emphasized

highly technical, hands-on role

strong depth in applied ML/LLMs

strong hands-on experience building ML or LLM-based systems

Deep understanding of observability data

Experience designing and deploying RAG pipelines, embeddings, intent models, or operational chatbots

Strong experience architecting streaming or event-driven systems

Excellent problem-solving skills

Hands-on experience with at least one of the following tools: Claude Code, Codex, GitHub Copilot

Good understanding of context engineering

Understanding of agentic harness frameworks

Experience building at least one MCP server

We are looking for a highly skilled Engineer – AIOps Engineering to design, build, and scale the next generation of intelligent operational platforms for our ecosystem. This role sits at the intersection of machine learning, LLMs, observability, automation and service reliability, enabling predictive and autonomous operations across a globally distributed environment. In this role, you will architect and implement AIOps capabilities such as intelligent incident routing, anomaly detection, operational copilots, ChatOps workflows, and automated remediation. You will partner closely with SRE, platform engineering, service management, and product teams to embed intelligence into operational workflows and redefine how digital operations are run.

This is a highly technical, hands-on role that requires strong depth in applied ML/LLMs, operational systems, automation frameworks, and observability data structures.

AIOps Platform & Intelligence Development

Design and build AIOps models using LLMs or classical ML for anomaly detection, correlation, root-cause identification, and intelligent event clustering.
Develop operational copilots and chatbots capable of responding to incidents, surfacing insights, and driving automation through natural language.
Build and maintain feature pipelines using telemetry, logs, metrics, traces, and runtime state for operational intelligence use cases.
Implement predictive and preventive operations use cases, including capacity forecasting, early warning systems, and noisy-neighbor detection.

LLM Engineering & Applied AI

Build knowledge-grounding systems for operational copilots using runbooks, incident data, historical patterns, service maps, and topology.
Integrate LLM-based reasoning into observability and automation platforms.
Develop embeddings, retrieval systems, RAG pipelines, and intent classification capabilities for operational queries.

Automation & Intelligent Remediation

Build automated workflows for incident triage, diagnostics, collaboration, and remediation.
Architect closed-loop automation patterns connecting alerts, insights, actions, and verification.
Develop reusable automation modules integrated with unified observability platforms, cloud platforms, and orchestration systems.

Data, Observability & Integration

Integrate AIOps models with observability platforms handling logs, metrics, traces, events, and topology data.
Design real-time inference systems for high-volume telemetry streams.
Partner with SRE and platform teams to ensure pipelines, data contracts, and instrumentation support future AIOps workloads.

Operational Excellence & Collaboration

Work with transformation teams to define AIOps onboarding patterns, enablement models, and implementation guidelines.
Drive AIOps adoption across multiple products and platforms, ensuring reliability, scalability, and continuous improvement.
Participate in architecture reviews, data modeling discussions, and SRE transformation initiatives.

Qualifications

5+ years of experience in Sytem Design, Platform & reliability engineering, ML engineering, data engineering, or AIOps-oriented roles.
Strong hands-on experience building ML or LLM-based systems using Python,Java, PyTorch, TensorFlow, or modern LLM frameworks.
Experience building automation workflows using tools such as StackStorm, Rundeck, Airflow, Jenkins, or cloud-native orchestration platforms.
Deep understanding of observability data, including logs, metrics, and traces.
Experience with observability platforms such as Datadog, Splunk, Prometheus, Grafana, or ELK.
Experience designing and deploying RAG pipelines, embeddings, intent models, or operational chatbots.
Strong experience architecting streaming or event-driven systems such as Kafka, Kinesis, or Pub/Sub.
Familiarity with cloud-native systems, Kubernetes, microservices, and modern deployment patterns.
Excellent problem-solving skills with the ability to translate operational challenges into ML-based or automation-based solutions.
Ability to collaborate effectively across SRE, platform, service management, and engineering teams.

Must-Have Skills • Hands-on experience with at least one of the following tools: Claude Code Codex GitHub Copilot • Good understanding of context engineering. • Understanding of agentic harness frameworks. • Experience building at least one MCP server

Career Level - IC4