Program Manager 4-proddev

Oracle Oracle · Enterprise · BENGALURU, KARNATAKA, India

This role is for a Principal Technical Program Manager focused on AI Infrastructure repair operations and fleet health within Oracle Cloud Infrastructure (OCI). The individual will lead cross-functional programs to improve AI infrastructure availability, repair efficiency, and operational excellence, working with engineering, SRE, operations, and supply chain teams. Key responsibilities include program ownership, KPI governance, executive reporting, incident management, and driving continuous improvement and automation in a rapidly scaling environment.

What you'd actually do

  1. Lead one or more major AI Infrastructure Repair domains, including strategic customer availability, partner repair execution, RMA/spares governance, repair workflow optimization, data and reporting frameworks, or engineering platform tooling.
  2. Translate availability and operational gaps into structured programs with clearly defined scope, milestones, KPIs, owners, dependencies, risks, and success criteria.
  3. Establish and drive operational mechanisms that ensure accountability, visibility, and execution excellence across repair programs.
  4. Own program reviews, executive reporting, and operational readiness assessments.
  5. Drive programs focused on improving AI Infrastructure fleet availability and repair performance.

Skills

Required

  • Technical Program Management
  • Program Operations
  • Release Management
  • Infrastructure Operations
  • cross-functional program leadership
  • stakeholder management
  • risk management
  • metrics development
  • dashboard creation
  • KPI definition
  • executive reporting
  • analytical skills
  • communication skills
  • organizational skills
  • cloud infrastructure
  • AI/ML infrastructure
  • GPU operations
  • fleet management
  • large-scale distributed systems

Nice to have

  • Master’s degree in Engineering, Computer Science, Business Administration, or related field
  • Experience supporting AI training or inference

What the JD emphasized

  • operational excellence is critical to customer success
  • high-visibility role
  • strong technical program management capabilities
  • operational rigor
  • ability to influence across organizational boundaries
  • rapidly scaling environment
  • customer availability
  • partner repair execution
  • RMA/spares governance
  • repair workflow optimization
  • data and reporting frameworks
  • engineering platform tooling
  • availability and operational gaps
  • structured programs
  • clearly defined scope
  • milestones
  • KPIs
  • owners
  • dependencies
  • risks
  • success criteria
  • operational mechanisms
  • accountability
  • visibility
  • execution excellence
  • program reviews
  • executive reporting
  • operational readiness assessments
  • fleet availability and repair performance
  • operational health indicators
  • systemic risks
  • customer availability
  • cross-functional recovery efforts
  • high-priority fleet health issues
  • customer-impacting events
  • durable corrective actions
  • KPI governance
  • executive reporting
  • repair health
  • fleet operations
  • Fleet Availability
  • Unavailable Host Backlog
  • Repair Cycle Time
  • Repair Success Rate
  • Reopen Rate
  • Spare Availability
  • RMA Performance
  • Partner Responsiveness
  • SLA/SLO Compliance
  • reporting frameworks
  • dashboards
  • leadership visibility
  • repair execution
  • operational risk
  • customer impact
  • actionable insights
  • executive recommendations
  • cross-functional leadership
  • alignment
  • engineering
  • SRE
  • data center operations
  • supply chain
  • hardware vendors
  • partner organizations
  • globally distributed teams
  • critical repair and fleet health initiatives
  • influence stakeholders
  • drive decisions
  • highly matrixed environment
  • direct authority
  • align priorities
  • remove blockers
  • accelerate execution
  • incident, risk, and change management
  • escalation management
  • availability risk
  • governance mechanisms
  • issue prioritization
  • risk management
  • escalation handling
  • root cause analysis
  • corrective and preventive actions
  • operational risks
  • mitigation strategies
  • continuous improvement
  • automation
  • process optimization
  • repair efficiency
  • reduce manual effort
  • operational scalability
  • standardize repair playbooks
  • escalation paths
  • governance mechanisms
  • reporting processes
  • automation opportunities
  • operational productivity
  • tooling investments
  • fleet visibility
  • triage efficiency
  • repair execution
  • 8+ years of experience
  • Technical Program Management
  • Program Operations
  • Release Management
  • Infrastructure Operations
  • related disciplines
  • 5+ years leading large-scale, cross-functional technical programs
  • conception through execution
  • managing complex programs
  • multiple stakeholders
  • dependencies
  • operational risks
  • analytical skills
  • developing metrics
  • dashboards
  • KPIs
  • executive reporting
  • drive execution
  • engineering
  • operations
  • business teams
  • verbal and written communication skills
  • executive-level communication
  • influence and drive accountability
  • matrixed organizations
  • organizational skills
  • manage multiple competing priorities