Sr Director It

AMD AMD · Semiconductors · San Jose, CA · General Management/ Administration/ Support

Lead the strategy, architecture, deployment, and operations of large-scale AI compute environments, including GPU clusters, private AI compute cloud platforms, GPU-as-a-Service capabilities, and hybrid deployments. Requires hands-on technical leadership in AI/HPC infrastructure, high-speed networking, data center enablement, security, automation, and production operations.

What you'd actually do

  1. Define and execute the roadmap for AMD’s AI compute infrastructure, including on-prem GPU clusters, private AI cloud, neo-cloud, and public cloud environments.
  2. Lead GPU cluster bring-up, production operations, monitoring, automation, capacity planning, availability, and utilization improvement.
  3. Build and operate GPU-as-a-Service capabilities for internal users and strategic external needs.
  4. Architect and manage front-end and back-end networks for AI workloads, including high-speed egress, storage connectivity, RoCE/RDMA fabrics, congestion management, and performance tuning.
  5. Partner with storage, security, data center, platform, software, and cloud teams to deliver secure, scalable, and reliable AI compute services.

Skills

Required

  • 15+ years of experience in infrastructure engineering, cloud infrastructure, AI/HPC systems, networking, data center infrastructure, or related fields.
  • Proven experience building, scaling, and operating GPU clusters, AI compute platforms, HPC environments, or large-scale cloud infrastructure.
  • Strong understanding of GPU infrastructure, cluster operations, Linux, automation, orchestration, monitoring, and production support.
  • Deep experience with high-speed networking, including Ethernet fabrics, RoCE/RDMA, NICs, switches, optics, routing, segmentation, and performance troubleshooting.
  • Experience enabling storage and data movement for AI/HPC workloads.
  • Experience with private cloud, GPU-as-a-Service, hybrid cloud, or internal compute platform delivery.
  • Strong background in security, multi-tenancy, access control, and operational reliability.
  • Demonstrated ability to lead technical teams and communicate effectively with senior stakeholders.

Nice to have

  • Experience with AMD GPU platforms, ROCm, or accelerator software ecosystems.
  • Experience with direct liquid cooling, liquid-to-chip cooling, high-density AI racks, and data center readiness for GPU workloads.
  • Experience selecting or enabling data centers for AI/HPC infrastructure.
  • Experience integrating on-premises AI compute with neo-cloud and tier-1 cloud providers.
  • End-to-end understanding of AI infrastructure from data center and networking through workload execution and token delivery.

What the JD emphasized

  • AI compute infrastructure
  • GPU clusters
  • AI workloads
  • high-speed networking
  • production operations

Other signals

  • AI compute infrastructure
  • GPU clusters
  • private AI compute cloud platforms
  • GPU-as-a-Service
  • large-scale AI compute environments