Principal Tpm -ai Infrastructure

The Principal TPM will lead cross-functional programs for Oracle's AI Infrastructure GPU Operations team, focusing on deployment planning, execution governance, operational readiness, and reliability for GPU infrastructure. The role involves managing operating mechanisms for regional deployment, fleet health, milestone tracking, executive reporting, and incident governance. A key aspect is improving scalability through dashboards, telemetry, documentation, and leveraging AI to enhance operations productivity. The role requires strong program discipline, business analytics, and the ability to translate ambiguous inputs into clear actions and metrics, supporting both AI training and inference workloads.

What you'd actually do

Drive availability and reliability of large-scale GPU fleets, identifying systemic issues and leading cross-functional recovery efforts.
Support operational readiness and performance of distributed AI training and inference workloads across multi-region GPU clusters.
Own end-to-end execution of critical AI Infrastructure GPU Operations programs, ensuring alignment with business priorities, customer needs, and operational risk signals.
Build, model, and maintain business planning inputs, financial forecasts, analytical views, and operating reports for AI Infrastructure GPU Operations programs.
Drive practical use of AI and automation to improve operations productivity, reduce manual toil, accelerate triage, improve ticket prioritization, and strengthen repeatability across GPU operations workflows.

Skills

Required

Technical program management
Program operations
Business operations
Data analysis
Infrastructure operations
Cross-functional initiative leadership
Business analytics
Program discipline
Communication with senior stakeholders
Ownership
Metrics-driven execution
Simplification
Scalability
Reliability
Operational mechanisms
Incident management
Change governance
Deployment governance
Risk management
Business planning
Financial forecasting
Executive reporting
Telemetry and observability
Automation
Documentation and playbook creation

Nice to have

Experience with AI infrastructure
Experience with GPU operations
Familiarity with NVIDIA and AMD GPU platforms
Familiarity with AI training and inference workloads
Experience with RoCE, InfiniBand, and data center networks
Practical use of AI to improve operations productivity

What the JD emphasized

GPU infrastructure
AI training and inference workloads
operational readiness
reliability
cross-functional programs
business analytics capability
customer impact
measurable reliability outcomes
technical and operational depth
disciplined execution
metrics
disciplined follow-through
scalability
reliability
clear operational mechanisms
senior stakeholders
consistent execution
ownership
metrics
disciplined follow-through
strategic clarity
technical and operational depth
reliable OCI AI Infrastructure GPU Operations
continuous improvement
processes
telemetry
automation
cross-site coordination
stakeholder alignment
partner engagements
large-scale GPU fleets
systemic issues
cross-functional recovery efforts
distributed AI training and inference workloads
multi-region GPU clusters
current and next-generation hardware
NVIDIA H200, B200, GB200/GB300 platforms
AMD Instinct MI300X, MI325X, MI350X, MI355X
end-to-end execution
critical AI Infrastructure GPU Operations programs
business priorities
customer needs
operational risk signals
weekly operating cadences
governance forums
multiple concurrent initiatives
clear ownership
timelines
dependencies
decision points
committed actions
cross-functional delivery
engineering
platform
operations
business operations
finance
observability
SRE
network
senior leadership stakeholders
deployment governance
change review
readiness tracking
stakeholder handoff
operational execution processes
structured incident management mechanisms
root cause analysis
corrective and preventive actions
durable fixes
primary escalation point
engineering and operations teams
priority conflicts
accelerating issue resolution
Change Review Board processes
high-volume change activity
change-related incidents
protecting service quality
business planning inputs
financial forecasts
analytical views
operating reports
AI Infrastructure GPU Operations programs
executive-level reporting
monthly business reviews
weekly operational KPIs
critical project updates
risks
dependencies
decisions
mitigation plans
data-driven insights
infrastructure performance
operational risk
customer impact
measurable program outcomes
senior leadership
hardware vendors
cloud platform teams
SRE
cloud engineering
network teams
internal stakeholders
issue resolution
operational efficiency
complex technical, operational, and business situations
accurate narratives
recommendations
action plans
senior stakeholders
structured escalation
bug reporting mechanisms
time-to-resolution
critical issues
documentation
playbooks
onboarding materials
runbooks
repeatable processes
ambiguity
execution quality
practical use of AI and automation
operations productivity
manual toil
accelerate triage
ticket prioritization
repeatability
GPU operations workflows
observability and telemetry teams
infrastructure visibility
RDMA telemetry
network fabric health
service health metrics
operational dashboarding
continuous improvement efforts
validation frameworks
version set validation
link flap analysis
long-tail performance optimization
operational health
RoCE
InfiniBand
large-scale data center networks
technical program management
program operations
business operations
data analysis
infrastructure operations
complex, cross-functional initiatives
measurable outcomes
technical
operations
business
customer

Other signals

GPU infrastructure
AI training and inference workloads
operational readiness
reliability
cross-functional programs

Read full job description

The AI Infrastructure GPU Operations Team drives deployment planning, execution governance, operational readiness, reliability, and business rhythm for OCI's rapidly expanding GPU infrastructure portfolio. As Principal Technical Program Manager, you will lead cross-functional programs that connect engineering, platform, operations, business, finance, observability, SRE, network, and leadership teams across complex GPU operations initiatives.

You will own operating mechanisms for regional deployment readiness, GPU fleet health, milestone tracking, executive reporting, incident and change governance, risk management, and operational handoff across multiple concurrent GPU operations programs. This role requires strong program discipline, business analytics capability, and the ability to turn ambiguous technical and operational inputs into clear priorities, metrics, decisions, and action plans.

You will also improve the way the organization scales by strengthening dashboards, telemetry, documentation, onboarding, playbooks, repeatable processes, and the practical use of AI to improve operations productivity. The ideal candidate brings crisp communication, strong ownership, and pragmatic simplification to high-visibility GPU operations programs where disciplined execution, customer impact, and measurable reliability outcomes matter.

You are a structured, data-driven program leader who values simplicity, scalability, reliability, and clear operational mechanisms. You thrive in collaborative environments, communicate crisply with senior stakeholders, and drive consistent execution through ownership, metrics, and disciplined follow-through. You combine strategic clarity with enough technical and operational depth to help teams deliver reliable OCI AI Infrastructure GPU Operations while continuously improving the processes, telemetry, and automation that support it.

Travel: as needed for cross-site coordination, stakeholder alignment, and partner engagements.

Key Responsibilities

GPU Fleet Operations & Reliability

Drive availability and reliability of large-scale GPU fleets, identifying systemic issues and leading cross-functional recovery efforts.
Support operational readiness and performance of distributed AI training and inference workloads across multi-region GPU clusters.
Lead GPU fleet health reviews across current and next-generation hardware, including NVIDIA H200, B200, GB200/GB300 platforms and AMD Instinct MI300X, MI325X, MI350X, MI355X, and related platforms.

Program Leadership & Execution

Own end-to-end execution of critical AI Infrastructure GPU Operations programs, ensuring alignment with business priorities, customer needs, and operational risk signals.
Set and run weekly operating cadences and governance forums across multiple concurrent initiatives, ensuring clear ownership, timelines, dependencies, decision points, and committed actions.
Coordinate cross-functional delivery across engineering, platform, operations, business operations, finance, observability, SRE, network, and senior leadership stakeholders.

Incident, Change & Deployment Governance

Manage deployment governance, change review, readiness tracking, stakeholder handoff, and operational execution processes.
Establish and scale structured incident management mechanisms, improving root cause analysis, corrective and preventive actions, and follow-through on durable fixes.
Serve as a primary escalation point between engineering and operations teams, resolving priority conflicts and accelerating issue resolution.
Lead Change Review Board processes for high-volume change activity, minimizing change-related incidents and protecting service quality.

Business Planning, Metrics & Executive Reporting

Build, model, and maintain business planning inputs, financial forecasts, analytical views, and operating reports for AI Infrastructure GPU Operations programs.
Own executive-level reporting, including monthly business reviews, weekly operational KPIs, critical project updates, risks, dependencies, decisions, and mitigation plans.
Provide data-driven insights into infrastructure performance, operational risk, customer impact, and measurable program outcomes for senior leadership.

Cross-Functional & Stakeholder Engagement

Strengthen partnerships with hardware vendors, cloud platform teams, SRE, cloud engineering, network teams, and other internal stakeholders to improve issue resolution and operational efficiency.
Translate complex technical, operational, and business situations into accurate narratives, recommendations, and action plans for senior stakeholders.
Drive structured escalation and bug reporting mechanisms that reduce time-to-resolution for critical issues.

Operational Excellence, Optimization & AI Productivity

Create and maintain documentation, playbooks, onboarding materials, runbooks, and repeatable processes that reduce ambiguity and improve execution quality.
Drive practical use of AI and automation to improve operations productivity, reduce manual toil, accelerate triage, improve ticket prioritization, and strengthen repeatability across GPU operations workflows.
Partner with observability and telemetry teams to improve infrastructure visibility, including RDMA telemetry, network fabric health, service health metrics, and operational dashboarding.
Lead continuous improvement efforts such as validation frameworks, version set validation, link flap analysis, and long-tail performance optimization.
Monitor and improve operational health across technologies such as RoCE, InfiniBand, and large-scale data center networks.

Qualifications / Experience

5+ years of experience in technical program management, program operations, business operations, data analysis, infrastructure operations, or a related discipline.
Demonstrated ability to lead complex, cross-functional initiatives with measurable outcomes across technical, operations, business, and customer-facing stakeholders.
Strong operational background with experience building cadences, governance mechanisms, KPI reporting, incident/change processes, risk management processes, or readiness programs.
Strong written and verbal communication skills; comfortable synthesizing complex technical and operational information into executive updates, recommendations, and decisions.
A high degree of organization and ability to manage multiple competing priorities independently through ambiguity.
Experience identifying, measuring, and adjusting execution plans against key business, operational, reliability, or delivery metrics.
Advanced Excel skills, including pivots, lookups, conditional logic, data modeling, and financial or operational analysis.
Experience developing dashboards, automated reporting, or analytical tools that provide reliable business and operational visibility.
Working knowledge of PowerPoint, Jira, Confluence, and related collaboration or delivery management tools.

Preferred / Nice to Have

Experience with cloud infrastructure, AI/ML infrastructure, GPU operations, data center deployment, capacity planning, or large-scale platform operations.
Experience supporting large GPU fleets, distributed AI training or inference workloads, or performance-sensitive infrastructure environments.
Experience with incident management, root cause analysis, corrective and preventive action tracking, Change Review Board processes, or high-volume change governance.
Familiarity with observability, telemetry, RDMA, RoCE, InfiniBand, network fabric health, service health metrics, ticket/incident analytics, or operational dashboarding.
Finance, business planning, workforce planning, or operational readiness experience in a technology organization.
Track record of influencing senior business and technology leaders without relying on direct authority.

Disclaimer:

Certain U.S. based or U.S. customer or client-facing roles may be required to comply with applicable requirements, such as immunization/occupational health mandates, and/or drug testing requirements.

Range and benefit information provided in this posting are specific to the stated locations only

US: Hiring Range in USD from: $102,300 to $209,500 per annum. May be eligible for bonus and equity.

Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business. Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.

Oracle US offers a comprehensive benefits package which includes the following:

Medical, dental, and vision insurance, including expert medical opinion
Short term disability and long term disability
Life insurance and AD&D
Supplemental life insurance (Employee/Spouse/Child)
Health care and dependent care Flexible Spending Accounts
Pre-tax commuter and parking benefits
401(k) Savings and Investment Plan with company match
Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
11 paid holidays
Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
Paid parental leave
Adoption assistance
Employee Stock Purchase Plan
Financial planning and group legal
Voluntary benefits including auto, homeowner and pet insurance

The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.

Career Level - IC4