Systems Design Engineer - AI Cluster Software

AMD AMD · Semiconductors · Austin, TX · Engineering

Systems Design Engineer focused on AI cluster software, creating reference architectures, configuration guides, and reproducible experiments for AMD-based AI infrastructure. The role involves deep technical evaluations of AI stacks across compute, storage, networking, and observability, and developing tools to validate performance hypotheses.

What you'd actually do

  1. Apply your expertise to shape AI infrastructure by creating reference architectures, configuration guides, and deployment blueprints that help internal teams and customers make informed hardware and software decisions.
  2. Perform deep technical evaluations of AI stacks across compute, storage, networking, and observability layers, documenting how they work, where they fit, and the tradeoffs involved.
  3. Design and execute reproducible experiments and benchmarking harnesses to compare technologies such as schedulers, distributed training libraries, and observability stacks.
  4. Develop small reference implementations and tools to validate performance hypotheses, analyze system behavior and more.
  5. Build a library of technical artifacts—including presentations, design documents, and “how it works” guides, to support pre-sales engineers and enable others to skill up from an HPC perspective.

Skills

Required

  • Linux operating systems
  • networking
  • filesystems
  • containers
  • performance tooling
  • distributed training
  • orchestration systems
  • performance analysis
  • technical documentation

Nice to have

  • ROCm
  • RCCL
  • Instinct GPUs
  • EPYC platforms
  • MPI/OpenMP
  • parallel filesystems
  • object stores
  • RDMA
  • Terraform
  • Ansible

What the JD emphasized

  • AI workloads like inferencing and training
  • building the blueprint
  • deep technical evaluations
  • reproducible experiments
  • performance hypotheses
  • distributed training internals
  • performance analysis

Other signals

  • AI infrastructure
  • reference architectures
  • performance benchmarking
  • distributed training
  • inference