Senior Failure Analysis Engineer - Test Development

AMD AMD · Semiconductors · Secaucus, NJ · Engineering

This role focuses on developing advanced test methods for GPU accelerator platforms, specifically targeting elusive failures. It involves designing custom execution flows using stress-based scenarios, VPOD environments, and AI/ML workloads. The engineer will build test content to improve repeatability, shorten debug cycles, and increase confidence in root cause findings. A key aspect is shaping intelligent test systems that use internal engineering knowledge and live model inference to guide execution decisions in real time, translating vague symptoms into testable conditions.

What you'd actually do

  1. Architect targeted test methods for hard-to-capture platform behaviors across GPU, server, and rack-scale environments.
  2. Invent new workload patterns, sequencing approaches, and stress combinations that reveal conditions not covered by conventional diagnostics.
  3. Build and maintain VPOD-based environments that support scalable experimentation, long-duration execution, and controlled reproduction studies.
  4. Use inference and training activity as system stimuli to probe platform limits, timing sensitivities, and failure-prone operating regions.
  5. Develop automation, scripting, and orchestration tools to launch workloads, monitor execution, collect logs, and analyze results at scale across Windows and Linux environments.

Skills

Required

  • Python
  • shell scripting
  • automation development
  • workload launch
  • orchestration
  • telemetry capture
  • post-run analysis
  • system data interpretation
  • debug artifact analysis

Nice to have

  • GPU and server platform behavior
  • system stress interactions
  • concurrency effects
  • stability characterization
  • VPOD environments
  • inference and training environments
  • diagnostics
  • firmware interactions
  • drivers
  • hardware/software boundaries
  • GPU data center infrastructure
  • AI/ML technologies
  • non-standard workload development

What the JD emphasized

  • hard-to-capture platform behaviors
  • elusive failures
  • hard-to-capture platform behaviors
  • AI/ML workloads
  • AI-enabled workflows
  • live model inference
  • AI-enabled test systems
  • intermittent, low-occurrence, or otherwise difficult-to-observe failure modes
  • inference and training environments, including their use as controllable system stressors
  • AI-enabled test systems that incorporate internal engineering knowledge and support real-time inference during execution

Other signals

  • AI/ML workloads
  • AI-enabled workflows
  • live model inference
  • AI-enabled test systems