Senior Software Sdet Test Development Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

NVIDIA is seeking a Senior Software SDET Test Development Engineer to join their platform SWQA team. The role involves developing and executing test plans for NVIDIA HGX/DGX/MGX platforms, including servers, OS, FW, and CUDA SW stack. Responsibilities include installing and testing systems, driving root cause analysis for failures, building automation frameworks, and managing bug lifecycles. The ideal candidate will have extensive experience in OS and server-level automation, CI/CD, DevOps, and troubleshooting in Linux environments, with good knowledge of AI tools, NLP, and LLM benchmarking.

What you'd actually do

  1. Responsible for the development and execution of NVIDIA HGX/DGX/MGX platform test plan on servers, OS, FW and CUDA SW stack from design doc.
  2. Installing and testing various systems OS, server firmware and SW stack.
  3. Drive support for root cause analysis on reliability and validation test failures to identify root cause(s) and achieve mitigation.
  4. Build, develop/debug server and OS level automation front-end and back-end framework and tests
  5. Review partner and supplier test results and prescribe additional reliability testing on components, servers, and packaging as needed.

Skills

Required

  • 5+ years proven experience
  • Proven years of OS and server level automation, CI/CD process and DevOps experience using Python, SHELL, Ansible, Jenkins, C/C++, Java, JavaScript
  • Strong server and Linux(Ubuntu, RedHat, CentOS, SuSE, Fedora and etc…) troubleshooting and debugging experience in a bare-metal and KVM/VMWare/Hyper-V environment.
  • Good knowledge and hands-on experience in model testing, AI tools/frameworks (TensorFlow, Pytorch, Cursor and etc…), NLP and LLM benchmarking
  • Experience in using AI development tools for test plans creation, test cases development and test cases automation

Nice to have

  • FW, BMC/OpenBMC, Network protocol, internal/external enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, Redfish
  • GitHub/Gitlab/Gerrit, PXE, SLURM, Stack/Kubernetes/Docker
  • AI related tools, LLM and NLP.
  • Experience working with NVIDIA GPU hardware
  • solid understanding of virtualization in Linux (KVM, Docker orchestrated with Kubernetes)
  • parallel programming ideally CUDA/OpenCL

What the JD emphasized

  • track record in developing AI tools and NLP
  • model testing, AI tools/frameworks (TensorFlow, Pytorch, Cursor and etc…), NLP and LLM benchmarking
  • Experience in using AI development tools for test plans creation, test cases development and test cases automation

Other signals

  • AI tools and NLP
  • LLM benchmarking
  • AI development tools for test plans creation
  • test cases development and test cases automation