Principal Tpm -ai Infrastructure

Oracle Oracle · Enterprise · Seattle, WA +1

This role focuses on managing and improving the operational aspects of AI infrastructure, specifically GPU operations for Oracle Cloud Infrastructure (OCI). The Principal TPM will lead cross-functional programs related to deployment planning, execution governance, operational readiness, reliability, and business rhythm for OCI's GPU infrastructure. Responsibilities include owning operating mechanisms for regional deployment readiness, GPU fleet health, milestone tracking, executive reporting, incident and change governance, risk management, and operational handoff. The role also involves improving scalability through dashboards, telemetry, documentation, and leveraging AI to enhance operations productivity. The ideal candidate will have strong program discipline, business analytics skills, and the ability to drive execution in a collaborative environment.

What you'd actually do

  1. Drive availability and reliability of large-scale GPU fleets, identifying systemic issues and leading cross-functional recovery efforts.
  2. Support operational readiness and performance of distributed AI training and inference workloads across multi-region GPU clusters.
  3. Own end-to-end execution of critical AI Infrastructure GPU Operations programs, ensuring alignment with business priorities, customer needs, and operational risk signals.
  4. Set and run weekly operating cadences and governance forums across multiple concurrent initiatives, ensuring clear ownership, timelines, dependencies, decision points, and committed actions.
  5. Build, model, and maintain business planning inputs, financial forecasts, analytical views, and operating reports for AI Infrastructure GPU Operations programs.

Skills

Required

  • 5+ years of experience in technical program management, program operations, business operations, data analysis, infrastructure operations, or a related discipline.
  • Demonstrated ability to lead complex, cross-functional initiatives with measurable outcomes across technical, operations, business, and customer-facing teams.
  • Strong program discipline
  • Business analytics capability
  • Pragmatic simplification
  • Structured, data-driven program leadership
  • Scalability
  • Reliability
  • Clear operational mechanisms
  • Crisp communication with senior stakeholders
  • Ownership
  • Metrics
  • Disciplined follow-through
  • Strategic clarity
  • Technical and operational depth
  • Continuous improvement
  • Incident management
  • Change governance
  • Executive reporting
  • Stakeholder engagement

Nice to have

  • Experience with NVIDIA H200, B200, GB200/GB300 platforms
  • Experience with AMD Instinct MI300X, MI325X, MI350X, MI355X
  • Experience with RoCE, InfiniBand, and large-scale data center networks
  • Experience with AI to improve operations productivity

What the JD emphasized

  • GPU infrastructure
  • AI training and inference workloads
  • operational readiness
  • reliability
  • cross-functional programs
  • business analytics
  • AI to improve operations productivity

Other signals

  • GPU infrastructure
  • AI training and inference workloads
  • deployment planning
  • operational readiness
  • reliability
  • cross-functional programs
  • business analytics
  • AI to improve operations productivity