What you'd actually do

Architect targeted test methods for hard-to-capture platform behaviors across GPU, server, and rack-scale environments.

Invent new workload patterns, sequencing approaches, and stress combinations that reveal conditions not covered by conventional diagnostics.

Build and maintain VPOD-based environments that support scalable experimentation, long-duration execution, and controlled reproduction studies.

Use inference and training activity as system stimuli to probe platform limits, timing sensitivities, and failure-prone operating regions.

Develop automation, scripting, and orchestration tools to launch workloads, monitor execution, collect logs, and analyze results at scale across Windows and Linux environments.

Skills

Required

Python
shell scripting
automation development
workload launch
orchestration
telemetry capture
post-run analysis
system data interpretation
debug artifact analysis

Nice to have

GPU and server platform behavior
system stress interactions
concurrency effects
stability characterization
VPOD environments
inference and training environments
diagnostics
firmware interactions
drivers
hardware/software boundaries
GPU data center infrastructure
AI/ML technologies
non-standard workload development

What the JD emphasized

hard-to-capture platform behaviors

elusive failures

hard-to-capture platform behaviors

AI/ML workloads

AI-enabled workflows

live model inference

AI-enabled test systems

intermittent, low-occurrence, or otherwise difficult-to-observe failure modes

inference and training environments, including their use as controllable system stressors

AI-enabled test systems that incorporate internal engineering knowledge and support real-time inference during execution

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. **Together, we advance your career. **

The ROLE:

The Quality Engineering team is looking for an experienced Senior Failure Analysis Engineer - Test Development to create advanced test methods that surface elusive failures in GPU accelerator platforms. This role is centered on designing custom execution flows that go beyond standard validation, using stress-based scenarios, VPOD environments, AI/ML workloads, and adaptive test logic to make hard-to-capture issues observable and actionable. The engineer will expand failure analysis capability across lab, factory, and customer-return cases by building test content that improves repeatability, shortens debug cycles, and increases confidence in root cause findings. They will also help shape intelligent test systems that use internal engineering knowledge and live model inference to guide execution decisions in real time. Working across FA, validation, firmware, diagnostics, and data teams, this person will help convert unclear symptoms into testable conditions that accelerate resolution.

**THE PERSON: **

The ideal candidate is inventive, methodical, and technically versatile, with a strong instinct for designing experiments that reveal behavior hidden under normal test conditions. They are comfortable navigating hardware, firmware, software, and system-level interactions, and know how to choose the right levers—environment, timing, workload composition, instrumentation, or automation—to provoke meaningful behavior. They are effective in VPOD-based test environments, capable of using model-driven compute activity as part of system stimulation, and confident building AI-enabled workflows that draw from team-specific knowledge during execution. Just as importantly, they can turn messy observations into disciplined experiments, communicate clearly across teams, and document approaches in a way others can reuse.

**KEY RESPONSIBILITIES: **

Architect targeted test methods for hard-to-capture platform behaviors across GPU, server, and rack-scale environments.
Invent new workload patterns, sequencing approaches, and stress combinations that reveal conditions not covered by conventional diagnostics.
Build and maintain VPOD-based environments that support scalable experimentation, long-duration execution, and controlled reproduction studies.
Use inference and training activity as system stimuli to probe platform limits, timing sensitivities, and failure-prone operating regions.
Develop automation, scripting, and orchestration tools to launch workloads, monitor execution, collect logs, and analyze results at scale across Windows and Linux environments.
Interpret telemetry, logs, and observed signatures to refine experiments, isolate trigger conditions, and improve confidence in reproduced behavior.
Create AI-enabled execution flows that use internal FA knowledge and live inference to guide test branching, detect emerging patterns, and support faster triage decisions.
Partner closely with FA, validation, diagnostics, firmware, and manufacturing teams to translate vague symptoms or sporadic field issues into targeted and repeatable test content.
Document workload intent, test methods, reproduction conditions, and findings clearly so they can be reused across teams and incorporated into future FA workflows.
Drive continuous improvement of test development methods, workload libraries, and failure reproduction strategies to expand FA coverage and reduce time to root cause.

**PREFERRED EXPERIENCE: **

Proven track record of developing custom test methodologies for intermittent, low-occurrence, or otherwise difficult-to-observe failure modes.
Strong foundation in GPU and server platform behavior, including system stress interactions, concurrency effects, and stability characterization.
Demonstrated ability to build, run, and optimize VPOD environments and related infrastructure for large-scale FA or validation test execution.
Hands-on familiarity with inference and training environments, including their use as controllable system stressors in platform investigation.
Proficient in Python, shell scripting, and automation development for workload launch, orchestration, telemetry capture, and post-run analysis.
Ability to interpret system data and debug artifacts to uncover meaningful signals and guide the next experimental step.
Familiarity with diagnostics, firmware interactions, drivers, and hardware/software boundaries that influence failure behavior under stress workloads.
Experience building AI-enabled test systems that incorporate internal engineering knowledge and support real-time inference during execution.
Strong communication, documentation, collaboration, and presentation skills, with the ability to explain complex reproduction strategies and findings across technical teams.
Experience with GPU data center infrastructure, AI/ML technologies, and non-standard workload development is a strong plus.

**ACADEMIC CREDENTIALS: **

Bachelor’s degree in Electrical Engineering, Computer Engineering, Computer Science, or a related field.

LOCATION:

Secaucus, NJ

**This role is not eligible for Visa sponsorship **

#LI-AP2

_Benefits offered are described: _AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.

_ _

This posting is for an existing vacancy.