Staff Systems Software Engineer- Server

NVIDIA NVIDIA · Semiconductors · Taipei, Taiwan +1

NVIDIA is seeking a Staff Systems Software Engineer to design, implement, and integrate GPU diagnostics for their next-generation GPU products. The role involves developing tests, integrating them into manufacturing and datacenter workflows, debugging issues across HW/FW/SW boundaries, and analyzing data to improve diagnostics. Requires strong C/C++, Python, and Linux system software experience, with a focus on server platforms and debugging skills.

What you'd actually do

  1. Implement and enhance GPU diagnostics covering power, thermal, memory, PCIe, NVLink, and system‑level checks on boards, servers, and racks.
  2. Develop stress and validation tests that exercise GPU subsystems and platform components; add clear pass/fail criteria, telemetry, and error codes suitable for automation and failure attribution.
  3. Contribute to the integration of diagnostics into L6/L10/L11 factory flows and datacenter workflows, working with senior engineers to define coverage, runtime, and sequencing.
  4. Execute and monitor automated regression tests and pipelines on top of orchestration systems.
  5. Debug issues in cooperation with hardware, firmware, and other teams (Ops/TE/AE, etc.); root‑cause problems that span HW/FW/SW boundaries.

Skills

Required

  • C/C++
  • Python
  • shell scripting
  • Linux system software development
  • x86 and/or ARM server architecture
  • GPU architecture concepts
  • debugging
  • problem-solving
  • gdb
  • perf
  • tracing frameworks
  • vendor debug utilities
  • technical leadership
  • communication skills
  • collaborative mindset

Nice to have

  • board bring-up tools
  • diagnostics
  • drivers
  • firmware and hardware registers
  • GPU accelerators integration
  • system logs interpretation
  • system/block diagrams and schematics interpretation
  • diagnostic tools in factory or datacenter environments
  • defining or improving diagnostic flows or RMA qualification flows
  • writing clear, high-coverage test plans and test cases
  • AI-assisted tools for log triage, code assistance, data analysis, or test generation

What the JD emphasized

  • diagnostics
  • system software
  • validation
  • debug
  • server platforms
  • GPU diagnostics
  • system-level checks
  • stress and validation tests
  • factory flows
  • datacenter workflows
  • automated regression tests
  • root-cause problems
  • logging and reporting
  • factory and field data
  • troubleshooting guides
  • system software
  • diagnostics
  • platform validation
  • low‑level system software
  • server platforms
  • server architecture
  • GPU architecture
  • system logs
  • system/block diagrams
  • schematics
  • debugging and problem-solving skills
  • technical leadership
  • driving projects
  • investigations
  • critical bugs to closure
  • cross-functional teams
  • communication skills
  • collaborative mindset
  • globally distributed teams
  • diagnostic tools
  • factory or datacenter environments
  • diagnostic flows
  • RMA qualification flows
  • complex hardware products
  • test plans
  • test cases
  • complex HW/SW systems
  • diagnostics development
  • debugging