What you'd actually do

Define and execute the roadmap for AMD’s AI compute infrastructure, including on-prem GPU clusters, private AI cloud, neo-cloud, and public cloud environments.

Lead GPU cluster bring-up, production operations, monitoring, automation, capacity planning, availability, and utilization improvement.

Build and operate GPU-as-a-Service capabilities for internal users and strategic external needs.

Architect and manage front-end and back-end networks for AI workloads, including high-speed egress, storage connectivity, RoCE/RDMA fabrics, congestion management, and performance tuning.

Partner with storage, security, data center, platform, software, and cloud teams to deliver secure, scalable, and reliable AI compute services.

Skills

Required

15+ years of experience in infrastructure engineering, cloud infrastructure, AI/HPC systems, networking, data center infrastructure, or related fields.
Proven experience building, scaling, and operating GPU clusters, AI compute platforms, HPC environments, or large-scale cloud infrastructure.
Strong understanding of GPU infrastructure, cluster operations, Linux, automation, orchestration, monitoring, and production support.
Deep experience with high-speed networking, including Ethernet fabrics, RoCE/RDMA, NICs, switches, optics, routing, segmentation, and performance troubleshooting.
Experience enabling storage and data movement for AI/HPC workloads.
Experience with private cloud, GPU-as-a-Service, hybrid cloud, or internal compute platform delivery.
Strong background in security, multi-tenancy, access control, and operational reliability.
Demonstrated ability to lead technical teams and communicate effectively with senior stakeholders.

Nice to have

Experience with AMD GPU platforms, ROCm, or accelerator software ecosystems.
Experience with direct liquid cooling, liquid-to-chip cooling, high-density AI racks, and data center readiness for GPU workloads.
Experience selecting or enabling data centers for AI/HPC infrastructure.
Experience integrating on-premises AI compute with neo-cloud and tier-1 cloud providers.
End-to-end understanding of AI infrastructure from data center and networking through workload execution and token delivery.

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. **Together, we advance your career. **

This role is not eligible for visa sponsorship.

THE ROLE

AMD is seeking a Senior Director, AI Compute Infrastructure & GPU Cloud Services to lead the strategy, architecture, deployment, and operations of large-scale AI compute environments. This role will enable GPU clusters, private AI compute cloud platforms, GPU-as-a-Service capabilities, and hybrid deployments across on-premises infrastructure, neo-clouds, and tier-1 public clouds.

The position requires a hands-on technical leader with strong experience in AI/HPC infrastructure, high-speed networking, data center enablement, security, automation, and production operations. This leader will lead team and help scale the platform, operating model, and roadmap to support AMD’s growing internal and external AI compute needs.

THE PERSON

The ideal candidate is a senior leader with deep technical expertise and strong execution skills. They have experience building and operating GPU clusters, enabling AI workloads, managing complex networks, improving utilization, and delivering reliable compute services. They can operate strategically while remaining hands-on when needed, and they bring a strong vision for how AI compute platforms should evolve.

KEY RESPONSIBILITIES

Define and execute the roadmap for AMD’s AI compute infrastructure, including on-prem GPU clusters, private AI cloud, neo-cloud, and public cloud environments.
Lead GPU cluster bring-up, production operations, monitoring, automation, capacity planning, availability, and utilization improvement.
Build and operate GPU-as-a-Service capabilities for internal users and strategic external needs.
Architect and manage front-end and back-end networks for AI workloads, including high-speed egress, storage connectivity, RoCE/RDMA fabrics, congestion management, and performance tuning.
Partner with storage, security, data center, platform, software, and cloud teams to deliver secure, scalable, and reliable AI compute services.
Drive end-to-end infrastructure readiness from data center planning through workload execution and token delivery.
Support data center evaluation and enablement for high-performance AI compute, including power, cooling, rack density, and direct liquid cooling requirements.
Lead, mentor, and grow a small technical team while working cross-functionally with engineering, IT, vendors, and executive stakeholders.

REQUIRED EXPERIENCE

15+ years of experience in infrastructure engineering, cloud infrastructure, AI/HPC systems, networking, data center infrastructure, or related fields.
Proven experience building, scaling, and operating GPU clusters, AI compute platforms, HPC environments, or large-scale cloud infrastructure.
Strong understanding of GPU infrastructure, cluster operations, Linux, automation, orchestration, monitoring, and production support.
Deep experience with high-speed networking, including Ethernet fabrics, RoCE/RDMA, NICs, switches, optics, routing, segmentation, and performance troubleshooting.
Experience enabling storage and data movement for AI/HPC workloads.
Experience with private cloud, GPU-as-a-Service, hybrid cloud, or internal compute platform delivery.
Strong background in security, multi-tenancy, access control, and operational reliability.
Demonstrated ability to lead technical teams and communicate effectively with senior stakeholders.

PREFERRED EXPERIENCE

Experience with AMD GPU platforms, ROCm, or accelerator software ecosystems.
Experience with direct liquid cooling, liquid-to-chip cooling, high-density AI racks, and data center readiness for GPU workloads.
Experience selecting or enabling data centers for AI/HPC infrastructure.
Experience integrating on-premises AI compute with neo-cloud and tier-1 cloud providers.
End-to-end understanding of AI infrastructure from data center and networking through workload execution and token delivery.

ACADEMIC CREDENTIALS

Bachelor’s degree in Computer Science, Electrical Engineering, Computer Engineering, or a related technical field, or equivalent experience. Advanced degree preferred.

LOCATION

San Jose, California

#LI-KR1

_Benefits offered are described: _AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.

_ _

This posting is for an existing vacancy.