Senior Software Development Engineer in Test - Datacenter Server Os

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Software Development Engineer in Test for Datacenter Server OS at NVIDIA. Responsibilities include developing and executing test plans for NVIDIA HGX/DGX/MGX platforms, installing and testing OS/firmware/SW stacks, driving root cause analysis for failures, and building automation frameworks. Requires strong Linux, OS, and server-level automation experience, CI/CD, DevOps, and knowledge of AI tools/frameworks (TensorFlow, PyTorch), NLP, and LLM benchmarking. Experience with AI development tools for test automation is also required.

What you'd actually do

  1. Responsible for the development and execution of NVIDIA HGX/DGX/MGX platform test plan on servers, OS, FW and CUDA SW stack from design doc.
  2. Installing and testing various systems OS, server firmware and SW stack.
  3. Drive support for root cause analysis on reliability and validation test failures to identify root cause(s) and achieve mitigation.
  4. Build, develop/debug server and OS level automation front-end and back-end framework and tests
  5. Review partner and supplier test results and prescribe additional reliability testing on components, servers, and packaging as needed.

Skills

Required

  • OS and server level automation
  • CI/CD process
  • DevOps experience
  • Python
  • SHELL
  • Ansible
  • Jenkins
  • C/C++
  • Java
  • JavaScript
  • Linux troubleshooting and debugging
  • model testing
  • AI tools/frameworks (TensorFlow, Pytorch, Cursor and etc…)
  • NLP
  • LLM benchmarking
  • AI development tools for test plans creation
  • test cases development
  • test cases automation

Nice to have

  • enterprise server integration
  • reliability testing with various telemetries
  • scale out cluster
  • test plan development
  • track record in developing AI tools and NLP
  • FW
  • BMC/OpenBMC
  • Network protocol
  • internal/external enterprise storage devices
  • PCIe buses and devices
  • IO sub-devices
  • CPU and memory
  • ACPI
  • UEFI spec
  • Redfish
  • GitHub/Gitlab/Gerrit
  • PXE
  • SLURM
  • Stack/Kubernetes/Docker
  • AI related tools
  • LLM and NLP
  • NVIDIA GPU hardware
  • virtualization in Linux (KVM, Docker orchestrated with Kubernetes)
  • parallel programming ideally CUDA/OpenCL

What the JD emphasized

  • AI tools for test plans creation
  • NLP and LLM benchmarking
  • AI tools and NLP