Principal Tpm -ai Infrastructure

Oracle Oracle · Enterprise · Austin, TX +1

The Principal TPM will lead cross-functional programs for Oracle's AI Infrastructure GPU Operations team, focusing on deployment planning, execution governance, operational readiness, and reliability for GPU infrastructure. The role involves managing operating mechanisms for regional deployment, fleet health, milestone tracking, executive reporting, and incident governance. A key aspect is improving scalability through dashboards, telemetry, documentation, and leveraging AI to enhance operations productivity. The role requires strong program discipline, business analytics, and the ability to translate ambiguous inputs into clear actions and metrics, supporting both AI training and inference workloads.

What you'd actually do

  1. Drive availability and reliability of large-scale GPU fleets, identifying systemic issues and leading cross-functional recovery efforts.
  2. Support operational readiness and performance of distributed AI training and inference workloads across multi-region GPU clusters.
  3. Own end-to-end execution of critical AI Infrastructure GPU Operations programs, ensuring alignment with business priorities, customer needs, and operational risk signals.
  4. Build, model, and maintain business planning inputs, financial forecasts, analytical views, and operating reports for AI Infrastructure GPU Operations programs.
  5. Drive practical use of AI and automation to improve operations productivity, reduce manual toil, accelerate triage, improve ticket prioritization, and strengthen repeatability across GPU operations workflows.

Skills

Required

  • Technical program management
  • Program operations
  • Business operations
  • Data analysis
  • Infrastructure operations
  • Cross-functional initiative leadership
  • Business analytics
  • Program discipline
  • Communication with senior stakeholders
  • Ownership
  • Metrics-driven execution
  • Simplification
  • Scalability
  • Reliability
  • Operational mechanisms
  • Incident management
  • Change governance
  • Deployment governance
  • Risk management
  • Business planning
  • Financial forecasting
  • Executive reporting
  • Telemetry and observability
  • Automation
  • Documentation and playbook creation

Nice to have

  • Experience with AI infrastructure
  • Experience with GPU operations
  • Familiarity with NVIDIA and AMD GPU platforms
  • Familiarity with AI training and inference workloads
  • Experience with RoCE, InfiniBand, and data center networks
  • Practical use of AI to improve operations productivity

What the JD emphasized

  • GPU infrastructure
  • AI training and inference workloads
  • operational readiness
  • reliability
  • cross-functional programs
  • business analytics capability
  • customer impact
  • measurable reliability outcomes
  • technical and operational depth
  • disciplined execution
  • metrics
  • disciplined follow-through
  • scalability
  • reliability
  • clear operational mechanisms
  • senior stakeholders
  • consistent execution
  • ownership
  • metrics
  • disciplined follow-through
  • strategic clarity
  • technical and operational depth
  • reliable OCI AI Infrastructure GPU Operations
  • continuous improvement
  • processes
  • telemetry
  • automation
  • cross-site coordination
  • stakeholder alignment
  • partner engagements
  • large-scale GPU fleets
  • systemic issues
  • cross-functional recovery efforts
  • distributed AI training and inference workloads
  • multi-region GPU clusters
  • current and next-generation hardware
  • NVIDIA H200, B200, GB200/GB300 platforms
  • AMD Instinct MI300X, MI325X, MI350X, MI355X
  • end-to-end execution
  • critical AI Infrastructure GPU Operations programs
  • business priorities
  • customer needs
  • operational risk signals
  • weekly operating cadences
  • governance forums
  • multiple concurrent initiatives
  • clear ownership
  • timelines
  • dependencies
  • decision points
  • committed actions
  • cross-functional delivery
  • engineering
  • platform
  • operations
  • business operations
  • finance
  • observability
  • SRE
  • network
  • senior leadership stakeholders
  • deployment governance
  • change review
  • readiness tracking
  • stakeholder handoff
  • operational execution processes
  • structured incident management mechanisms
  • root cause analysis
  • corrective and preventive actions
  • durable fixes
  • primary escalation point
  • engineering and operations teams
  • priority conflicts
  • accelerating issue resolution
  • Change Review Board processes
  • high-volume change activity
  • change-related incidents
  • protecting service quality
  • business planning inputs
  • financial forecasts
  • analytical views
  • operating reports
  • AI Infrastructure GPU Operations programs
  • executive-level reporting
  • monthly business reviews
  • weekly operational KPIs
  • critical project updates
  • risks
  • dependencies
  • decisions
  • mitigation plans
  • data-driven insights
  • infrastructure performance
  • operational risk
  • customer impact
  • measurable program outcomes
  • senior leadership
  • hardware vendors
  • cloud platform teams
  • SRE
  • cloud engineering
  • network teams
  • internal stakeholders
  • issue resolution
  • operational efficiency
  • complex technical, operational, and business situations
  • accurate narratives
  • recommendations
  • action plans
  • senior stakeholders
  • structured escalation
  • bug reporting mechanisms
  • time-to-resolution
  • critical issues
  • documentation
  • playbooks
  • onboarding materials
  • runbooks
  • repeatable processes
  • ambiguity
  • execution quality
  • practical use of AI and automation
  • operations productivity
  • manual toil
  • accelerate triage
  • ticket prioritization
  • repeatability
  • GPU operations workflows
  • observability and telemetry teams
  • infrastructure visibility
  • RDMA telemetry
  • network fabric health
  • service health metrics
  • operational dashboarding
  • continuous improvement efforts
  • validation frameworks
  • version set validation
  • link flap analysis
  • long-tail performance optimization
  • operational health
  • RoCE
  • InfiniBand
  • large-scale data center networks
  • technical program management
  • program operations
  • business operations
  • data analysis
  • infrastructure operations
  • complex, cross-functional initiatives
  • measurable outcomes
  • technical
  • operations
  • business
  • customer

Other signals

  • GPU infrastructure
  • AI training and inference workloads
  • operational readiness
  • reliability
  • cross-functional programs