Senior AI Cluster Hardware Engineer

AMD AMD · Semiconductors · Austin, TX · Engineering

This role focuses on optimizing the performance of GPU clusters, specifically the RDMA networks used in AI clusters, by analyzing data flows between GPU, NIC, and the cluster network. Responsibilities include scalability testing, performance profiling and tuning, documentation, and collaboration with hardware and software teams. The ideal candidate has a strong background in GPU architectures, parallel computing, and system-level performance tuning.

What you'd actually do

  1. Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (RoCE & IB)
  2. Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.
  3. Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.
  4. Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.
  5. Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.

Skills

Required

  • GPU architectures
  • parallel computing
  • system level performance tuning
  • debug methodologies
  • scripting languages (e.g., Python, Bash)
  • system level performance analysis tools
  • analytical mindset
  • problem-solving skills
  • debug skills
  • RDMA network configuration
  • troubleshooting
  • performance tuning
  • Linux kernel networking expertise

Nice to have

  • Machine learning
  • HPC system design
  • cluster management tools and systems
  • understanding of RDMA network drivers

What the JD emphasized

  • debugging complex HW/FW and drivers
  • RDMA network configuration, troubleshooting and performance tuning
  • Linux kernel networking expertise