Principal Software Quality Engineer – GPU & Machine Learning

AMD · Semiconductors · San Jose, CA · Engineering

Principal Software Quality Engineer at AMD focusing on ROCm software validation for GPU and Machine Learning workloads. The role involves defining and owning the end-to-end validation architecture, setting release qualification gates, driving compute workload validation (including LLM training and inference), architecting test infrastructure, and leading complex debugging efforts. A key requirement is hands-on experience with agentic AI engineering environments for daily work.

What you'd actually do

Own the end-to-end validation architecture for ROCm — unit, integration, framework, workload, performance, stress, stability, scale-out, and system-level test layers — across multiple GPU generations and server platforms.
Define release-qualification gates and exit criteria for ROCm software releases (functional coverage, performance regressions, stability hours, scale targets, RAS criteria) and drive the org to meet them.
Drive compute workload validation and characterization — LLM training and inference (PyTorch, vLLM, Triton, JAX), recommender systems, scientific HPC kernels, MLPerf-class benchmarks — establishing reproducible methodology, baselines, and regression tracking.
Architect the test infrastructure — distributed test runners, GitHub Actions / Jenkins / internal CI fleets, hardware lab orchestration, result data lakes, flaky-test detection, bisection automation, and self-service developer pre-submit pipelines.
Lead complex escalation debug — partner with development, hardware, firmware, and customer-facing teams to root-cause the hardest multi-day, multi-node, multi-component failures and convert findings into durable test coverage.

Skills

Required

12+ years of professional software engineering experience with a strong validation, SDET, or quality-engineering focus, including 5+ years in a senior IC role (Staff/Principal/PMTS or equivalent) leading validation of complex systems software.
BS/MS/PhD in Computer Science, Computer Engineering, or related discipline (or equivalent demonstrated experience).
Expert-level Python for test automation and infrastructure; strong C++ for debugging, and extending production code paths under test.
Deep, demonstrable validation experience in at least two of the following domains: GPU compute software stacks (ROCm, CUDA, oneAPI, SYCL), Deep-learning frameworks and inference engines (PyTorch, TensorFlow, JAX, Triton, vLLM), HPC / parallel runtimes and communication libraries (MPI, RCCL/NCCL, UCX, Libfabric), Linux kernel, GPU drivers, or accelerator firmware, Distributed systems and large-scale cluster software, System-level validation for server-class compute nodes — multi-GPU, multi-node, fabric-attached environments — including stress/stability, soak, fault-injection, and RAS testing.
Proven, hands-on experience working efficiently in an agentic AI engineering environment — daily, production use of LLM-based coding agents (e.g., Cursor, Claude Code, Copilot Workspace, Codex-class agents) and orchestration frameworks for real engineering work, with demonstrable productivity, quality, or coverage gains attributable to those workflows. Comfort designing prompts, tool/MCP integrations, evaluation harnesses, and guardrails for autonomous and semi-autonomous agents.
Hands-on experience defining and shipping release qualification programs for software consumed by hyperscalers, OEMs, or other Tier-1 customers.
Mastery of GitHub at scale for quality engineering — PR gating, GitHub Actions, self-hosted runners, required status checks, release tagging, and open-source contribution and triage norms.
Strong command of modern, agile so

Nice to have

AMD Instinct™ GPU platforms
multi-GPU topologies, PCIe/Infinity Fabric/xGMI, BMC/IPMI, thermal/power, firmware interactions, and multi-node fabric (Ethernet/InfiniBand/UALink) bring-up and validation
LLM training and inference (PyTorch, vLLM, Triton, JAX)
recommender systems, scientific HPC kernels, MLPerf-class benchmarks
distributed test runners, GitHub Actions / Jenkins / internal CI fleets, hardware lab orchestration, result data lakes, flaky-test detection, bisection automation, and self-service developer pre-submit pipelines
shift-left testing, test pyramids, contract testing between layers, hermetic test environments, deterministic reproducers, and continuous validation in trunk
PR gating policy, required checks, code-coverage standards, bug-bash and triage cadences, and disciplined issue management across ROCm/* repositories and partner upstream projects
partner with development, hardware, firmware, and customer-facing teams
product management, silicon, platform, and software architecture
next-generation Instinct GPUs and server platforms before tape-in milestones and silicon arrival
Mentor and elevate Senior and Staff validation engineers, SDETs, and SQA leads
strategic customer engagements, OEM qualification programs, and open-source community quality initiatives

What the JD emphasized

Expert-level Python for test automation and infrastructure
Deep, demonstrable validation experience in at least two of the following domains: GPU compute software stacks (ROCm, CUDA, oneAPI, SYCL), Deep-learning frameworks and inference engines (PyTorch, TensorFlow, JAX, Triton, vLLM), HPC / parallel runtimes and communication libraries (MPI, RCCL/NCCL, UCX, Libfabric), Linux kernel, GPU drivers, or accelerator firmware, Distributed systems and large-scale cluster software, System-level validation for server-class compute nodes
Proven, hands-on experience working efficiently in an agentic AI engineering environment
Hands-on experience defining and shipping release qualification programs for software consumed by hyperscalers, OEMs, or other Tier-1 customers.
Mastery of GitHub at scale for quality engineering

Other signals

Own the end-to-end validation architecture for ROCm
Define release-qualification gates and exit criteria for ROCm software releases
Drive compute workload validation and characterization — LLM training and inference
Architect the test infrastructure
Proven, hands-on experience working efficiently in an agentic AI engineering environment

Read full job description

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. **Together, we advance your career. **

About the Role

We are seeking a Principal Member of Technical Staff (PMTS) to serve as the senior technical leader for ROCm software validation across compute workloads and server-class systems. In this individual-contributor leadership role, you will define how AMD proves ROCm is ready to ship — from unit and component testing, through full-stack workload validation, to multi-node system-level qualification on AMD Instinct™ GPU platforms. You will set the technical direction for validation strategy, build and evolve the test infrastructure that gates every ROCm release, and personally drive the hardest debugging, characterization, and qualification problems. Your work directly determines the quality bar experienced by hyperscalers, OEMs, sovereign-AI customers, and the open-source community running ROCm in production.

What You Will Do

Own the end-to-end validation architecture for ROCm — unit, integration, framework, workload, performance, stress, stability, scale-out, and system-level test layers — across multiple GPU generations and server platforms.
Define release-qualification gates and exit criteria for ROCm software releases (functional coverage, performance regressions, stability hours, scale targets, RAS criteria) and drive the org to meet them.
Lead system-level testing for server nodes — multi-GPU topologies, PCIe/Infinity Fabric/xGMI, BMC/IPMI, thermal/power, firmware interactions, and multi-node fabric (Ethernet/InfiniBand/UALink) bring-up and validation.
Drive compute workload validation and characterization — LLM training and inference (PyTorch, vLLM, Triton, JAX), recommender systems, scientific HPC kernels, MLPerf-class benchmarks — establishing reproducible methodology, baselines, and regression tracking.
Architect the test infrastructure — distributed test runners, GitHub Actions / Jenkins / internal CI fleets, hardware lab orchestration, result data lakes, flaky-test detection, bisection automation, and self-service developer pre-submit pipelines.
Champion modern, agile quality engineering — shift-left testing, test pyramids, contract testing between layers, hermetic test environments, deterministic reproducers, and continuous validation in trunk.
Set the bar for GitHub-based quality workflows — PR gating policy, required checks, code-coverage standards, bug-bash and triage cadences, and disciplined issue management across ROCm/* repositories and partner upstream projects.
Lead complex escalation debug — partner with development, hardware, firmware, and customer-facing teams to root-cause the hardest multi-day, multi-node, multi-component failures and convert findings into durable test coverage.
Influence the roadmap — work with product management, silicon, platform, and software architecture to ensure validation readiness for next-generation Instinct GPUs and server platforms before tape-in milestones and silicon arrival.
Mentor and elevate Senior and Staff validation engineers, SDETs, and SQA leads; raise the technical bar through design review, code review, and written guidance.
Represent ROCm validation externally — strategic customer engagements, OEM qualification programs, and open-source community quality initiatives.

Minimum Qualifications

12+ years of professional software engineering experience with a strong validation, SDET, or quality-engineering focus, including 5+ years in a senior IC role (Staff/Principal/PMTS or equivalent) leading validation of complex systems software.
BS/MS/PhD in Computer Science, Computer Engineering, or related discipline (or equivalent demonstrated experience).
Expert-level Python for test automation and infrastructure; strong C++ for debugging, and extending production code paths under test.
Deep, demonstrable validation experience in at least two of the following domains:
GPU compute software stacks (ROCm, CUDA, oneAPI, SYCL)
Deep-learning frameworks and inference engines (PyTorch, TensorFlow, JAX, Triton, vLLM)
HPC / parallel runtimes and communication libraries (MPI, RCCL/NCCL, UCX, Libfabric)
Linux kernel, GPU drivers, or accelerator firmware
Distributed systems and large-scale cluster software
System-level validation for server-class compute nodes — multi-GPU, multi-node, fabric-attached environments — including stress/stability, soak, fault-injection, and RAS testing.
Proven, hands-on experience working efficiently in an agentic AI engineering environment — daily, production use of LLM-based coding agents (e.g., Cursor, Claude Code, Copilot Workspace, Codex-class agents) and orchestration frameworks for real engineering work, with demonstrable productivity, quality, or coverage gains attributable to those workflows. Comfort designing prompts, tool/MCP integrations, evaluation harnesses, and guardrails for autonomous and semi-autonomous agents.
Hands-on experience defining and shipping release qualification programs for software consumed by hyperscalers, OEMs, or other Tier-1 customers.
Mastery of GitHub at scale for quality engineering — PR gating, GitHub Actions, self-hosted runners, required status checks, release tagging, and open-source contribution and triage norms.
Strong command of modern, agile software development practices — trunk-based development, CI/CD, shift-left testing, observability, feature flags, and incremental delivery — applied specifically to validation organizations.
Excellent written and verbal communication — able to author crisp test plans, qualification reports, RFCs, and post-mortems, and to influence development teams without authority.

Preferred Qualifications

Direct contributions to validation, CI, or test infrastructure for ROCm, PyTorch, LLVM, Triton, vLLM, or comparable upstream open-source projects.
Demonstrated leadership in agentic-AI adoption — built or rolled out agent-based workflows across an engineering team (e.g., autonomous test generation, AI-driven log/triage pipelines, multi-agent debug systems, MCP server design, retrieval-augmented engineering knowledge bases) with measurable outcomes.
Experience operating or validating large GPU clusters (256+ GPUs) — fabric bring-up, cluster health monitoring, and fleet-level diagnostics.
Familiarity with Training/Inference/HPC industry-standard benchmark methodologies and submissions.
Background in performance validation: roofline analysis, profiler tooling (rocprof, Omniperf, Nsight-class), regression detection
Experience with fault injection, RAS, telemetry, and long-haul stability programs for accelerator platforms.
Familiarity with hardware lab automation: BMC/IPMI/Redfish, PDU control, serial-console capture, automated re-imaging, and topology-aware test scheduling.
Prior experience standing up validation for pre-silicon / emulation / first-silicon bring-up of accelerators.

Why This Role

ROCm powers AI and HPC workloads on AMD Instinct GPUs at the largest scale in the industry. The quality of every ROCm release is felt across millions of GPUs in production — and the validation organization is what stands between "code complete" and "customer ready." As Principal MTS for ROCm Validation, you will define that bar, build the systems that enforce it, and personally lead the toughest qualification problems on AMD's most strategic platforms.

#LI-TC1

#Hybrid

AMD is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

_Benefits offered are described: _AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.

_ _

This posting is for an existing vacancy.