Principal Software Development Engineer- Gpu/ai/ml

AMD AMD · Semiconductors · Santa Clara, CA · Engineering

Principal Software Development Engineer at AMD focusing on GPU/AI/ML performance. The role involves optimizing the AI software stack from GPU kernels to distributed systems, accelerating foundation models and agents, and co-designing hardware and software for AI workloads on AMD GPUs. Key responsibilities include improving training, post-training, and inference performance, and contributing to the ROCm ecosystem.

What you'd actually do

  1. Own the AI software stack: Establish best practices and drive performance from low-level GPU kernels to large-scale distributed systems. Use modern LLMs and agent-based tooling where it accelerates development and tuning of the ROCm ecosystem.
  2. Accelerate foundation models and agents: Improve training, post-training, and inference for LLMs and autonomous AI workloads so AMD is the default platform for the most demanding use cases.
  3. Co-design hardware and software: Partner on the full lifecycle—from GPU architecture input to software for new accelerators—and engage with the broader AI community to keep AMD at the forefront.

Skills

Required

  • Expert-level modern C++
  • design of large, performance-critical systems
  • Strong grasp of GPU architecture, memory hierarchy, and kernel optimization (HIP/CUDA)
  • Hands-on delivery on large-scale C++/HIP/CUDA codebases
  • Comfort diagnosing bottlenecks with profilers in multi-GPU, distributed settings
  • Deep understanding of transformers, attention, and the full model lifecycle
  • Hands-on work in alignment and post-training—for example, SFT, RLHF, and GRPO
  • Awareness of current LLM trends, including MoE, quantization, speculative decoding, and agentic systems
  • Experience optimizing post-training and inference pipelines at scale

Nice to have

  • Substantial professional experience in software development within performance-critical environments
  • Extensive HIP/CUDA experience optimizing deep learning and OSS LLM inference/training kernels and operators
  • Strong technical ownership and a track record of shipping complex systems
  • Clear communication and influence across teams
  • Deep familiarity with the AMD ROCm/HIP ecosystem
  • Working knowledge of RTL design and Verilog/SystemVerilog for hardware–software co-design
  • Master's degree
  • PhD
  • Publications in AI/ML, GPU computing, or systems optimization

What the JD emphasized

  • performance-critical systems
  • kernel optimization
  • large-scale distributed systems
  • performance
  • large-scale C++/HIP/CUDA codebases
  • diagnosing bottlenecks
  • optimizing post-training and inference pipelines at scale
  • performance-critical environments
  • shipping complex systems

Other signals

  • improving how models train, align, and run on GPUs
  • tuning stacks, kernels, and workflows
  • improve training, post-training, and inference for LLMs and autonomous AI workloads