Sr. Software Development Engineer - Collectives and Network

AMD AMD · Semiconductors · San Jose, CA · Engineering

Software Development Engineer focused on optimizing AI pre-training and distributed inference performance on AMD GPUs. This role involves deep dives into network, NIC, and GPU hardware architecture, software optimization, performance modeling, and collaboration across hardware, software, and AI framework teams.

What you'd actually do

  1. Performance tuning, profiling and analysis of large-scale models for LLM, diffusion, multimodal, RecSys and generative AI, single node and distributed. In addition to exploring various tradeoffs and design decisions.
  2. Develop and improve framework, tools and infrastructure for performance estimation, modeling and reporting.
  3. Provide guidelines to customers on efficient network load-balancing, workload scheduling and model sharding strategies.
  4. Participate in hardware-software co-design for future hardware optimizations – especially on scale-up networks, NIC and scale-out networks.
  5. Help with strategy and roadmap for AMD Collectives and Network optimizations.

Skills

Required

  • Network, NIC and GPU hardware architecture
  • software optimization
  • performance modeling
  • AI frameworks
  • inference and training optimization
  • mapping model architecture to low level software, hardware
  • PyTorch
  • JAX
  • vLLM
  • SGLang
  • performance analysis
  • network hardware architecture

Nice to have

  • technical leadership skills
  • work collaboratively with cross-functional teams
  • Mentor, coach, and inspire a diverse and talented team of researchers and engineers
  • Excellent written, verbal, and presentation skills
  • coordinate internally and externally

What the JD emphasized

  • latest state-of-the-art AI models
  • distributed inference and deployment at scale is crucial

Other signals

  • performance optimization
  • distributed inference
  • AI Pre-training
  • AMD GPU
  • ROCm