Principal Staff Software Developer – Ai/ml Performance Validation & Systems Testing

AMD AMD · Semiconductors · MARKHAM, Canada · Engineering

Principal Staff Software Developer focused on AI/ML performance validation and systems testing for AMD's ROCm software stack. The role involves owning the end-to-end validation architecture, defining release-qualification gates, leading system-level testing, and driving compute workload validation for LLM training/inference and other AI/HPC workloads. Requires deep experience in GPU compute software, deep-learning frameworks, and agentic AI engineering environments, with a focus on shipping software for hyperscalers and OEMs.

What you'd actually do

  1. Own the end-to-end validation architecture for ROCm — unit, integration, framework, workload, performance, stress, stability, scale-out, and system-level test layers — across multiple GPU generations and server platforms.
  2. Define release-qualification gates and exit criteria for ROCm software releases (functional coverage, performance regressions, stability hours, scale targets, RAS criteria) and drive the org to meet them.
  3. Lead system-level testing for server nodes — multi-GPU topologies, PCIe/Infinity Fabric/xGMI, BMC/IPMI, thermal/power, firmware interactions, and multi-node fabric (Ethernet/InfiniBand/UALink) bring-up and validation.
  4. Drive compute workload validation and characterization — LLM training and inference (PyTorch, vLLM, Triton, JAX), recommender systems, scientific HPC kernels, MLPerf-class benchmarks — establishing reproducible methodology, baselines, and regression tracking.
  5. Architect the test infrastructure — distributed test runners, GitHub Actions / Jenkins / internal CI fleets, hardware lab orchestration, result data lakes, flaky-test detection, bisection automation, and self-service developer pre-submit pipelines.

Skills

Required

  • Expert-level Python for test automation and infrastructure
  • Strong C++ for debugging, and extending production code paths under test
  • Deep, demonstrable validation experience in GPU compute software stacks (ROCm, CUDA, oneAPI, SYCL)
  • Deep, demonstrable validation experience in deep-learning frameworks and inference engines (PyTorch, TensorFlow, JAX, Triton, vLLM)
  • Proven, hands-on experience working efficiently in an agentic AI engineering environment
  • Hands-on experience defining and shipping release qualification programs for software consumed by hyperscalers, OEMs, or other Tier-1 customers
  • Mastery of GitHub at scale for quality engineering

Nice to have

  • HPC / parallel runtimes and communication libraries (MPI, RCCL/NCCL, UCX, Libfabric)
  • Linux kernel, GPU drivers, or accelerator firmware
  • Distributed systems and large-scale cluster software
  • System-level validation for server-class compute nodes — multi-GPU, multi-node, fabric-attached environments — including stress/stability, soak, fault-injection, and RAS testing

What the JD emphasized

  • Own the end-to-end validation architecture
  • Define release-qualification gates
  • Lead system-level testing
  • Drive compute workload validation and characterization
  • Architect the test infrastructure
  • Champion modern, agile quality engineering
  • Set the bar for GitHub-based quality workflows
  • Lead complex escalation debug
  • Influence the roadmap
  • Mentor and elevate Senior and Staff validation engineers
  • Represent ROCm validation externally
  • Proven, hands-on experience working efficiently in an agentic AI engineering environment
  • Hands-on experience defining and shipping release qualification programs for software consumed by hyperscalers, OEMs, or other Tier-1 customers.
  • Mastery of GitHub at scale for quality engineering

Other signals

  • validation architecture for ROCm
  • release-qualification gates
  • compute workload validation and characterization
  • LLM training and inference
  • test infrastructure architecture
  • agentic AI engineering environment