Senior Software Qa Test Development Engineer - Diagnostics

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

This role is for a Senior Software QA Test Development Engineer focused on diagnostics for NVIDIA's platform SWQA team. The responsibilities include developing and executing test plans for servers, OS, firmware, and CUDA SW stack, driving root cause analysis for failures, and building automation frameworks. The role requires experience in OS and server level automation, CI/CD, DevOps, and knowledge of AI tools/frameworks like TensorFlow, PyTorch, NLP, and LLM benchmarking. While the role involves testing AI-related components and using AI tools for testing, its core function is QA and test development, not direct AI model development or research.

What you'd actually do

  1. Responsible for the development and execution of NVIDIA HGX/DGX/MGX platform test plan on servers, OS, FW and CUDA SW stack from design doc.
  2. Installing and testing various systems OS, server firmware and SW stack.
  3. Drive support for root cause analysis on reliability and validation test failures to identify root cause(s) and achieve mitigation.
  4. Build, develop/debug server and OS level automation front-end and back-end framework and tests
  5. Review partner and supplier test results and prescribe additional reliability testing on components, servers, and packaging as needed.

Skills

Required

  • OS and server level automation
  • CI/CD process
  • DevOps experience
  • Python
  • SHELL
  • Ansible
  • Jenkins
  • C/C++
  • Java
  • JavaScript
  • Server and Linux troubleshooting and debugging
  • Bare-metal and KVM/VMWare/Hyper-V environment experience
  • Model testing
  • AI tools/frameworks (TensorFlow, Pytorch, Cursor)
  • NLP
  • LLM benchmarking
  • AI development tools for test plans creation
  • test cases development
  • test cases automation

Nice to have

  • FW, BMC/OpenBMC
  • Network protocol
  • Internal/external enterprise storage devices
  • PCIe buses and devices
  • IO sub-devices
  • CPU and memory
  • ACPI
  • UEFI spec
  • Redfish
  • GitHub/Gitlab/Gerrit
  • PXE
  • SLURM
  • Stack/Kubernetes/Docker
  • AI related tools
  • LLM
  • NLP
  • NVIDIA GPU hardware
  • Virtualization in Linux (KVM, Docker orchestrated with Kubernetes)
  • Parallel programming
  • CUDA/OpenCL

What the JD emphasized

  • Proven years of OS and server level automation, CI/CD process and DevOps experience using Python, SHELL, Ansible, Jenkins, C/C++, Java, JavaScript
  • Strong server and Linux(Ubuntu, RedHat, CentOS, SuSE, Fedora and etc…) troubleshooting and debugging experience in a bare-metal and KVM/VMWare/Hyper-V environment.
  • Good knowledge and hands-on experience in model testing, AI tools/frameworks (TensorFlow, Pytorch, Cursor and etc…), NLP and LLM benchmarking
  • Experience in using AI development tools for test plans creation, test cases development and test cases automation