We are looking for a highly skilled Engineer – AIOps Engineering to design, build, and scale the next generation of intelligent operational platforms for our ecosystem. This role sits at the intersection of machine learning, LLMs, observability, automation and service reliability, enabling predictive and autonomous operations across a globally distributed environment. In this role, you will architect and implement AIOps capabilities such as intelligent incident routing, anomaly detection, operational copilots, ChatOps workflows, and automated remediation. You will partner closely with SRE, platform engineering, service management, and product teams to embed intelligence into operational workflows and redefine how digital operations are run.
This is a highly technical, hands-on role that requires strong depth in applied ML/LLMs, operational systems, automation frameworks, and observability data structures.
AIOps Platform & Intelligence Development
- Design and build AIOps models using LLMs or classical ML for anomaly detection, correlation, root-cause identification, and intelligent event clustering.
- Develop operational copilots and chatbots capable of responding to incidents, surfacing insights, and driving automation through natural language.
- Build and maintain feature pipelines using telemetry, logs, metrics, traces, and runtime state for operational intelligence use cases.
- Implement predictive and preventive operations use cases, including capacity forecasting, early warning systems, and noisy-neighbor detection.
LLM Engineering & Applied AI
- Build knowledge-grounding systems for operational copilots using runbooks, incident data, historical patterns, service maps, and topology.
- Integrate LLM-based reasoning into observability and automation platforms.
- Develop embeddings, retrieval systems, RAG pipelines, and intent classification capabilities for operational queries.
Automation & Intelligent Remediation
- Build automated workflows for incident triage, diagnostics, collaboration, and remediation.
- Architect closed-loop automation patterns connecting alerts, insights, actions, and verification.
- Develop reusable automation modules integrated with unified observability platforms, cloud platforms, and orchestration systems.
Data, Observability & Integration
- Integrate AIOps models with observability platforms handling logs, metrics, traces, events, and topology data.
- Design real-time inference systems for high-volume telemetry streams.
- Partner with SRE and platform teams to ensure pipelines, data contracts, and instrumentation support future AIOps workloads.
Operational Excellence & Collaboration
- Work with transformation teams to define AIOps onboarding patterns, enablement models, and implementation guidelines.
- Drive AIOps adoption across multiple products and platforms, ensuring reliability, scalability, and continuous improvement.
- Participate in architecture reviews, data modeling discussions, and SRE transformation initiatives.
Qualifications
- 5+ years of experience in Sytem Design, Platform & reliability engineering, ML engineering, data engineering, or AIOps-oriented roles.
- Strong hands-on experience building ML or LLM-based systems using Python,Java, PyTorch, TensorFlow, or modern LLM frameworks.
- Experience building automation workflows using tools such as StackStorm, Rundeck, Airflow, Jenkins, or cloud-native orchestration platforms.
- Deep understanding of observability data, including logs, metrics, and traces.
- Experience with observability platforms such as Datadog, Splunk, Prometheus, Grafana, or ELK.
- Experience designing and deploying RAG pipelines, embeddings, intent models, or operational chatbots.
- Strong experience architecting streaming or event-driven systems such as Kafka, Kinesis, or Pub/Sub.
- Familiarity with cloud-native systems, Kubernetes, microservices, and modern deployment patterns.
- Excellent problem-solving skills with the ability to translate operational challenges into ML-based or automation-based solutions.
- Ability to collaborate effectively across SRE, platform, service management, and engineering teams.
Must-Have Skills • Hands-on experience with at least one of the following tools: Claude Code Codex GitHub Copilot • Good understanding of context engineering. • Understanding of agentic harness frameworks. • Experience building at least one MCP server
Career Level - IC4