Senior System Software Engineer, Firmware

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

This role is for a Senior System Software Engineer focused on firmware and system bring-up for datacenter applications, particularly in the context of GPU computing and AI hardware. The engineer will work on scheduling, resource management, and infrastructure for large-scale compute environments, collaborating with various engineering teams. While the role supports AI advancements, its core focus is on the underlying system infrastructure and firmware, not direct AI model development or research.

What you'd actually do

  1. Provide engineering solutions to enable large scale scheduling and resource management for performance for GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assisting systems, machine learning/deep learning engineers in building creative solutions based on NVIDIA technology.
  2. Be an internal reference for scheduling, IO and other datacenter and large-scale GPU-accelerated system solutions among the NVIDIA technical community.

Skills

Required

  • accelerated computing for datacenter/HPC solutions
  • OS and server level automation
  • CI/CD process
  • DevOps
  • Python
  • SHELL
  • Ansible
  • Jenkins
  • server and Linux troubleshooting and debugging
  • bare-metal/KVM/K8S environment
  • data center infrastructure
  • bare metal provisioning
  • testing
  • bringup
  • SRE principles
  • observability
  • SLOs
  • logging
  • FW
  • BMC/OpenBMC
  • Network protocol
  • enterprise storage devices
  • PCIe buses and devices
  • IO sub-devices
  • CPU and memory
  • ACPI
  • UEFI
  • redfish
  • communication skills
  • multitask effectively
  • analytical skills
  • BS (or equivalent experience) in Engineering, Mathematics, Physics, or Computer Science

Nice to have

  • Host management systems (DHCP, Redfish, UEFI)
  • host security services such as TPM, TXT, and SecureBoot
  • telemetry catalog
  • observability stack
  • container technology
  • software defined network
  • MS or PhD

What the JD emphasized

  • 6+ years of experience using in accelerated computing for datacenter/HPC solutions
  • Strong server and Linux(Ubuntu, RedHat, CentOS, SuSE, Fedora and etc…) troubleshooting and debugging experience in a bare-metal/KVM/K8S environment
  • Experience working data center infrastructure for bare metal provisioning, testing and bringup
  • Strong experience in FW, BMC/OpenBMC, Network protocol, internal/external enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI, redfish