Software Development Engineer - Collectives and Network

AMD AMD · Semiconductors · San Jose, CA · Engineering

Software Development Engineer focused on optimizing AI pre-training and distributed inference performance on AMD GPUs. This role involves deep dives into network, NIC, and GPU hardware architecture, software optimization, performance modeling, and working with AI frameworks to achieve industry-leading performance for various AI models.

What you'd actually do

  1. Assist with strategy and roadmap for AMD Collectives and Network optimizations.
  2. Provide guidelines to customers on efficient network load-balancing, workload scheduling and model sharding strategies.
  3. Performance tuning, profiling and analysis of large-scale models for LLM, diffusion, multimodal, RecSys and generative AI, single node and distributed. In addition to exploring various tradeoffs and design decisions.
  4. Participate in hardware-software co-design for future hardware optimizations – especially on scale-up networks, NIC and scale-out networks.
  5. Develop and improve framework, tools and infrastructure for performance estimation, modeling and reporting.

Skills

Required

  • Network, NIC and GPU hardware architecture
  • software optimization
  • performance modeling
  • AI frameworks
  • inference and training optimization
  • mapping model architecture to low level software, hardware
  • understanding the impact of each layer of the stack on model performance
  • latest generative model architecture
  • distributed inference
  • deployment at scale
  • performance tuning
  • profiling
  • analysis of large-scale models
  • LLM
  • diffusion
  • multimodal
  • RecSys
  • generative AI
  • single node and distributed
  • hardware-software co-design
  • scale-up networks
  • NIC
  • scale-out networks
  • framework, tools and infrastructure for performance estimation, modeling and reporting
  • PyTorch
  • JAX
  • vLLM
  • SGLang

Nice to have

  • technical leadership skills
  • work collaboratively with cross-functional teams
  • Mentor, coach, and inspire a diverse and talented team of researchers and engineers
  • Excellent written, verbal, and presentation skills
  • ability to coordinate internally and externally

What the JD emphasized

  • latest state-of-the-art AI models
  • distributed inference and deployment at scale is crucial

Other signals

  • performance optimization
  • distributed inference
  • AI frameworks
  • GPU hardware