AI Platform Reliability Engineer

Oracle Oracle · Enterprise · United States

Seeking an AI Platform Reliability Engineer to ensure the reliability, observability, and safety of AI agent platforms and AI-enabled analytics workflows in production. This role focuses on the operational foundation of AI systems, including monitoring, tracing, production evaluation, and quality safeguards, while also supporting data reliability use cases.

What you'd actually do

  1. Build and maintain observability, logging, tracing, and monitoring for AI agents, agent tools, and AI-enabled analytics workflows.
  2. Implement release, rollout, rollback, and versioning controls for prompts, models, tools, and configurations.
  3. Design and support production evaluation practices to detect regressions, silent failures, quality drift, and performance issues.
  4. Contribute to data monitoring and reliability workflows, including detection of stopped processing, data gaps, freshness issues, schema drift, and anomalies.
  5. Support incident response, triage, root-cause analysis, and operational reporting for AI and data reliability issues.

Skills

Required

  • Observability
  • Monitoring
  • Tracing
  • Logging
  • Release controls
  • Rollback controls
  • Versioning
  • Incident response
  • Root-cause analysis
  • Data reliability
  • Schema drift detection
  • Anomaly detection
  • AI agents
  • Agent tools
  • AI-enabled analytics workflows

Nice to have

  • Prompt engineering
  • Model versioning
  • Configuration management
  • Performance monitoring
  • Cost monitoring

What the JD emphasized

  • AI agent platform
  • AI-enabled analytics workflows
  • production AI systems
  • observability
  • release safety
  • operational tooling
  • trust in AI outputs
  • scale safely
  • production-ready
  • measurable
  • maintainable
  • operational safeguards
  • auditability
  • controlled deployment practices

Other signals

  • AI agent platform
  • AI-enabled analytics workflows
  • production AI systems
  • observability
  • release safety
  • operational tooling