Staff/principal Engineer - Ai/ml & System-level Validation

AMD AMD · Semiconductors · Hyderabad, India · Engineering

Staff/Principal Engineer focused on validating ROCm software for AMD Instinct GPU platforms, covering end-to-end validation architecture, release-qualification gates, system-level testing, and compute workload characterization (including LLM training/inference). The role involves architecting test infrastructure, championing agile quality engineering, leading debug efforts, and influencing roadmaps, with a strong emphasis on using AI/ML agentic tools for engineering productivity.

What you'd actually do

  1. Own the end-to-end validation architecture for ROCm — unit, integration, framework, workload, performance, stress, stability, scale-out, and system-level test layers — across multiple GPU generations and server platforms.
  2. Define release-qualification gates and exit criteria for ROCm software releases (functional coverage, performance regressions, stability hours, scale targets, RAS criteria) and drive the org to meet them.
  3. Lead system-level testing for server nodes — multi-GPU topologies, PCIe/Infinity Fabric/xGMI, BMC/IPMI, thermal/power, firmware interactions, and multi-node fabric (Ethernet/InfiniBand/UALink) bring-up and validation.
  4. Drive compute workload validation and characterization — LLM training and inference (PyTorch, vLLM, Triton, JAX), recommender systems, scientific HPC kernels, MLPerf-class benchmarks — establishing reproducible methodology, baselines, and regression tracking.
  5. Architect the test infrastructure — distributed test runners, GitHub Actions / Jenkins / internal CI fleets, hardware lab orchestration, result data lakes, flaky-test detection, bisection automation, and self-service developer pre-submit pipelines.

Skills

Required

  • Python for test automation and infrastructure
  • C++ for debugging
  • GPU compute software stacks (ROCm, CUDA, oneAPI, SYCL)
  • Deep-learning frameworks and inference engines (PyTorch, TensorFlow, JAX, Triton, vLLM)
  • System-level validation for server-class compute nodes
  • Agentic AI engineering environment
  • LLM-based coding agents
  • Prompt design
  • Tool/MCP integrations
  • Evaluation harnesses
  • Guardrails for autonomous and semi-autonomous agents
  • Release qualification programs

Nice to have

  • HPC / parallel runtimes and communication libraries (MPI, RCCL/NCCL, UCX, Libfabric)
  • Linux kernel, GPU drivers, or accelerator firmware
  • Distributed systems and large-scale cluster software
  • GitHub Actions
  • Jenkins
  • CI fleets
  • Hardware lab orchestration
  • Flaky-test detection
  • Bisection automation
  • Self-service developer pre-submit pipelines
  • GitHub-based quality workflows
  • PR gating policy
  • Code-coverage standards
  • Bug-bash and triage cadences
  • Issue management

What the JD emphasized

  • ROCm software validation
  • release-qualification gates
  • system-level validation for server-class compute nodes
  • Proven, hands-on experience working efficiently in an agentic AI engineering environment

Other signals

  • ROCm software validation
  • LLM training and inference
  • release-qualification gates
  • system-level testing for server nodes
  • test infrastructure architecture