Sr. Staff Software Development Engineer - Collectives and Network Optimization

AMD AMD · Semiconductors · San Jose, CA · Engineering

Senior engineer focused on optimizing AI pre-training and distributed inference performance on AMD GPUs. Responsibilities include strategy, architecture, optimization, tooling, and performance analysis across the software stack, with a focus on network and collectives optimization. Requires deep knowledge of hardware architecture, software optimization, AI frameworks, and distributed inference at scale.

What you'd actually do

  1. Help set strategy and roadmap for AMD Collectives and Network optimizations.
  2. Provide guidelines to customers on efficient network load-balancing, workload scheduling and model sharding strategies.
  3. Performance tuning, profiling and analysis of large-scale models for LLM, diffusion, multimodal, RecSys and generative AI, single node and distributed. In addition to exploring various tradeoffs and design decisions.
  4. Participate in hardware-software co-design for future hardware optimizations – especially on scale-up networks, NIC and scale-out networks.
  5. Develop and improve framework, tools and infrastructure for performance estimation, modeling and reporting.

Skills

Required

  • Network, NIC and GPU hardware architecture
  • software optimization
  • performance modeling
  • AI frameworks
  • inference and training optimization
  • mapping model architecture to low level software
  • distributed inference
  • PyTorch
  • JAX
  • vLLM
  • SGLang

Nice to have

  • technical leadership
  • cross-functional teams
  • mentor, coach, and inspire
  • Excellent written, verbal, and presentation skills

What the JD emphasized

  • distributed inference
  • latest state-of-the-art AI models
  • performance optimization
  • network hardware architecture
  • AI Frameworks

Other signals

  • performance optimization
  • distributed inference
  • AI frameworks
  • large-scale models
  • AMD GPU