AI Platform Reliability Engineer

Oracle Oracle · Enterprise · United States

Seeking an AI Platform Reliability Engineer to ensure the reliability, observability, and safety of AI agent platforms and AI-enabled analytics workflows in production. This role focuses on the operational foundation, including monitoring, tracing, evaluation, and quality safeguards for AI systems, as well as data reliability use cases.

What you'd actually do

  1. Build and maintain observability, logging, tracing, and monitoring for AI agents, agent tools, and AI-enabled analytics workflows.
  2. Implement release, rollout, rollback, and versioning controls for prompts, models, tools, and configurations.
  3. Design and support production evaluation practices to detect regressions, silent failures, quality drift, and performance issues.
  4. Contribute to data monitoring and reliability workflows, including detection of stopped processing, data gaps, freshness issues, schema drift, and anomalies.
  5. Support incident response, triage, root-cause analysis, and operational reporting for AI and data reliability issues.

Skills

Required

  • Observability
  • Monitoring
  • Tracing
  • Logging
  • Release controls
  • Rollback controls
  • Versioning
  • Incident response
  • Root-cause analysis
  • Data reliability
  • Schema drift detection
  • Anomaly detection
  • Production readiness
  • Maintainability
  • Latency monitoring
  • Throughput monitoring
  • Cost monitoring
  • Operational safeguards
  • Auditability
  • Controlled deployment practices

Nice to have

  • AI agent platform experience
  • AI-enabled analytics workflow experience
  • AI safety

What the JD emphasized

  • AI agent platform
  • AI-enabled analytics workflows
  • production AI systems
  • operational foundation
  • observability
  • release safety
  • operational tooling
  • AI and agent-based systems
  • trust in AI outputs
  • scale safely
  • AI agents
  • agent tools
  • AI-enabled analytics workflows
  • prompts, models, tools, and configurations
  • production evaluation practices
  • data monitoring and reliability workflows
  • AI and data reliability issues
  • AI engineers
  • AI-enabled systems
  • enterprise AI platforms

Other signals

  • AI agent platform
  • AI-enabled analytics workflows
  • production AI systems
  • operational foundation