AI Operations Engineer 2

Expedia Expedia · Hospitality · Prague, Czech Republic

AI Operations Engineer role focused on operating, scaling, and improving AI-enabled services and platforms in a cloud-first AWS environment. The role emphasizes operational ownership, infrastructure as code, automation, and safe integration of AI-powered capabilities into real-world systems, rather than building models. Responsibilities include operating and monitoring AI-driven production systems, implementing automation, collaborating with ML teams, applying system design principles, and ensuring quality, compliance, and security through controls and guardrails.

What you'd actually do

  1. Operate, monitor, and optimize AI‑driven production systems and related services to ensure reliability, availability, and performance within defined SLAs.
  2. You will implement and maintain automation, tooling, and runbooks that simplify AI system operations, including deployment, change management, incident response, and recovery.
  3. You will collaborate with software engineers, data and ML teams, and product partners to support AI workloads in production, including model‑serving infrastructure, APIs, and data pipelines.
  4. You will apply system design, API design, and data modeling principles to improve the robustness, observability, and maintainability of AI‑related services and platforms.
  5. You will safely integrate and operate AI/ML‑enabled solutions that improve outcomes, ensuring appropriate controls, monitoring, and guardrails for quality, compliance, and security.

Skills

Required

  • Relevant technical degree or equivalent practical experience
  • Professional experience operating or supporting production services or platforms
  • Strong background in DevOps, SRE, or Automation-focused Operations
  • Proven experience automating operational processes
  • managing infrastructure via code
  • Solid scripting or programming skills, preferably Python
  • Strong understanding of cloud infrastructure, with AWS experience required
  • Practical exposure to AI‑driven systems, tools, or workflows in production

Nice to have

  • Experience operating AI or ML workloads at scale
  • Demonstrated ability to design and implement robust operational architectures for AI‑driven services
  • Proven track record of driving operational excellence for complex, business‑critical systems
  • Familiarity with AI‑driven systems, tools, or workflows and applying AI/ML concepts to real‑world products
  • Experience collaborating with engineering and data/ML teams to influence system design, APIs, and data models

What the JD emphasized

  • AWS experience required
  • operating AI or ML workloads at scale
  • model deployment, versioning, rollout strategies, and monitoring model performance and system behavior in production
  • design and implement robust operational architectures for AI-driven services, including observability, resilience patterns, and capacity planning
  • driving operational excellence for complex, business‑critical systems
  • safely integrating and operating AI/ML‑enabled solutions

Other signals

  • operating AI-enabled services
  • DevOps, SRE, and platform engineering principles
  • automation, tooling, and runbooks
  • support AI workloads in production
  • model-serving infrastructure
  • API design
  • data modeling
  • guardrails for quality, compliance, and security
  • operational excellence
  • reliability
  • observability