Technical Program Manager- AI Cluster Engineering

AMD AMD · Semiconductors · Austin, TX · Engineering

This role is for a Technical Program Manager to drive end-to-end execution of AI cluster engineering programs, focusing on GPU platforms, rack-scale solutions, high-speed networking, and datacenter AI infrastructure. The TPM will manage program plans, risks, and dependencies for server integration, rack bring-up, and cluster-scale deployment readiness, working cross-functionally with engineering, operations, vendors, and customers.

What you'd actually do

  1. Define, plan, and drive program plans for AI infrastructure systems validation and readiness, including server integration, rack bring-up, and cluster-scale deployment readiness.
  2. Own program execution for rack- and cluster-network enablement, including topology decisions, switching/optics/cabling readiness, and validation schedules for scale-out operation.
  3. Lead cross-functional delivery for rack solutions that integrate CPU + GPU+ NICs, ensuring end-to-end readiness across hardware, firmware, and management interfaces.
  4. Own program coordination for pod/rack manageability solutions, aligning requirements and milestones for inventory, health monitoring, cluster provisioning, and observability across large-scale deployments.
  5. Drive readiness for rack-level automation and regression workflows (scripts, log mapping, infrastructure automation planning), planning execution to de-risk hardware arrival timing.

Skills

Required

  • Proven program management experience delivering complex, cross-functional hardware/software infrastructure programs (server/rack/cluster environments).
  • Strong understanding of datacenter building blocks and lifecycle: servers, racks, clusters, HW/FW/SW integration, and readiness/validation flows.
  • Demonstrated ability to build and run schedules, manage risks, lead matrix teams, and communicate clearly to engineering and executive audiences.
  • Strong working knowledge of program tools (e.g., Jira/Confluence/Microsoft Office) and dashboard-based execution management.

Nice to have

  • AI cluster networking domain experience (NICs, switching/optics/cabling, topologies).
  • Familiarity with rack/pod management and operations concerns (telemetry, health monitoring, power control, FW provisioning, management networks).
  • Experience leading programs for integration of servers in OEMs, ODMs, or data centers
  • Demonstrated horizontal leadership across large matrix organizations.
  • Formal PM education/certification (PMP / Scrum Master) preferred.

What the JD emphasized

  • AI cluster engineering programs
  • GPU platforms, rack-scale solutions, high-speed networking, and datacenter AI infrastructure
  • server integration to rack and cluster-level validation
  • AI networking requirements
  • GPU / Rack Solution Integration
  • AI Infrastructure & Manageability
  • Automation, Tooling, and Regression Readiness