Principal Software Quality Engineer – GPU & Machine Learning

AMD AMD · Semiconductors · San Jose, CA · Engineering

Principal Software Quality Engineer at AMD focusing on ROCm software validation for GPU and Machine Learning workloads. The role involves defining and owning the end-to-end validation architecture, setting release qualification gates, driving compute workload validation (including LLM training and inference), architecting test infrastructure, and leading complex debugging efforts. A key requirement is hands-on experience with agentic AI engineering environments for daily work.

What you'd actually do

  1. Own the end-to-end validation architecture for ROCm — unit, integration, framework, workload, performance, stress, stability, scale-out, and system-level test layers — across multiple GPU generations and server platforms.
  2. Define release-qualification gates and exit criteria for ROCm software releases (functional coverage, performance regressions, stability hours, scale targets, RAS criteria) and drive the org to meet them.
  3. Drive compute workload validation and characterization — LLM training and inference (PyTorch, vLLM, Triton, JAX), recommender systems, scientific HPC kernels, MLPerf-class benchmarks — establishing reproducible methodology, baselines, and regression tracking.
  4. Architect the test infrastructure — distributed test runners, GitHub Actions / Jenkins / internal CI fleets, hardware lab orchestration, result data lakes, flaky-test detection, bisection automation, and self-service developer pre-submit pipelines.
  5. Lead complex escalation debug — partner with development, hardware, firmware, and customer-facing teams to root-cause the hardest multi-day, multi-node, multi-component failures and convert findings into durable test coverage.

Skills

Required

  • 12+ years of professional software engineering experience with a strong validation, SDET, or quality-engineering focus, including 5+ years in a senior IC role (Staff/Principal/PMTS or equivalent) leading validation of complex systems software.
  • BS/MS/PhD in Computer Science, Computer Engineering, or related discipline (or equivalent demonstrated experience).
  • Expert-level Python for test automation and infrastructure; strong C++ for debugging, and extending production code paths under test.
  • Deep, demonstrable validation experience in at least two of the following domains: GPU compute software stacks (ROCm, CUDA, oneAPI, SYCL), Deep-learning frameworks and inference engines (PyTorch, TensorFlow, JAX, Triton, vLLM), HPC / parallel runtimes and communication libraries (MPI, RCCL/NCCL, UCX, Libfabric), Linux kernel, GPU drivers, or accelerator firmware, Distributed systems and large-scale cluster software, System-level validation for server-class compute nodes — multi-GPU, multi-node, fabric-attached environments — including stress/stability, soak, fault-injection, and RAS testing.
  • Proven, hands-on experience working efficiently in an agentic AI engineering environment — daily, production use of LLM-based coding agents (e.g., Cursor, Claude Code, Copilot Workspace, Codex-class agents) and orchestration frameworks for real engineering work, with demonstrable productivity, quality, or coverage gains attributable to those workflows. Comfort designing prompts, tool/MCP integrations, evaluation harnesses, and guardrails for autonomous and semi-autonomous agents.
  • Hands-on experience defining and shipping release qualification programs for software consumed by hyperscalers, OEMs, or other Tier-1 customers.
  • Mastery of GitHub at scale for quality engineering — PR gating, GitHub Actions, self-hosted runners, required status checks, release tagging, and open-source contribution and triage norms.
  • Strong command of modern, agile so

Nice to have

  • AMD Instinct™ GPU platforms
  • multi-GPU topologies, PCIe/Infinity Fabric/xGMI, BMC/IPMI, thermal/power, firmware interactions, and multi-node fabric (Ethernet/InfiniBand/UALink) bring-up and validation
  • LLM training and inference (PyTorch, vLLM, Triton, JAX)
  • recommender systems, scientific HPC kernels, MLPerf-class benchmarks
  • distributed test runners, GitHub Actions / Jenkins / internal CI fleets, hardware lab orchestration, result data lakes, flaky-test detection, bisection automation, and self-service developer pre-submit pipelines
  • shift-left testing, test pyramids, contract testing between layers, hermetic test environments, deterministic reproducers, and continuous validation in trunk
  • PR gating policy, required checks, code-coverage standards, bug-bash and triage cadences, and disciplined issue management across ROCm/* repositories and partner upstream projects
  • partner with development, hardware, firmware, and customer-facing teams
  • product management, silicon, platform, and software architecture
  • next-generation Instinct GPUs and server platforms before tape-in milestones and silicon arrival
  • Mentor and elevate Senior and Staff validation engineers, SDETs, and SQA leads
  • strategic customer engagements, OEM qualification programs, and open-source community quality initiatives

What the JD emphasized

  • Expert-level Python for test automation and infrastructure
  • Deep, demonstrable validation experience in at least two of the following domains: GPU compute software stacks (ROCm, CUDA, oneAPI, SYCL), Deep-learning frameworks and inference engines (PyTorch, TensorFlow, JAX, Triton, vLLM), HPC / parallel runtimes and communication libraries (MPI, RCCL/NCCL, UCX, Libfabric), Linux kernel, GPU drivers, or accelerator firmware, Distributed systems and large-scale cluster software, System-level validation for server-class compute nodes
  • Proven, hands-on experience working efficiently in an agentic AI engineering environment
  • Hands-on experience defining and shipping release qualification programs for software consumed by hyperscalers, OEMs, or other Tier-1 customers.
  • Mastery of GitHub at scale for quality engineering

Other signals

  • Own the end-to-end validation architecture for ROCm
  • Define release-qualification gates and exit criteria for ROCm software releases
  • Drive compute workload validation and characterization — LLM training and inference
  • Architect the test infrastructure
  • Proven, hands-on experience working efficiently in an agentic AI engineering environment