Senior Engineering Manager - Compute Server Bring up

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Engineering Manager to lead the Compute Server Bring-Up team, responsible for the bringup, integration, validation, and troubleshooting of compute tray platforms for GPU Racks in data centers. This role involves leading a team of engineers and collaborating with cross-functional teams to ensure server functionality before mass deployment.

What you'd actually do

  1. Own Initial Power-On and Board Bring-Up: Lead the initial power-on and functional validation of compute trays (CPU, GPU, NIC, storage including NVMe, cooling, etc.) internally and with customers. Ensure all functional requirements are met.
  2. Form and lead a virtual team across NVIDIA software & firmware teams to ensure subject matter experts are available as needed throughout bringup. Regular reporting on status of bringup to provide visibility and ensure teams across company are fully activated to help.
  3. Oversee flashing, updating, and validation of firmware for all server components as per defined architecture. Ensure appropriate validation done for boundary, stress, and regression testing, and confirm telemetry, logging, and hardware management features working as per requirements. Document pain points, bring up failures, recovery flows, and provide actionable feedback to hardware, firmware, and software teams. Ensure usability, firmware/BIOS update coverage, and error reporting for reliable customer installation and operation
  4. Debug, Issue Resolution & Customer Support: Lead root cause analysis and resolution of bring-up failures. Collaborate with partners, ODMs, and customers for technical support.
  5. Product Ownership: Drive product life cycles with QA teams, ensuring robust bring up, productization, and delivery.

Skills

Required

  • Systems/platform software team management
  • Server bring up
  • Firmware development
  • Data center solutions
  • Matrix environment leadership
  • Virtual team leadership
  • Compute tray designs
  • Firmware enablement
  • System-level architecture
  • Scalable server product delivery
  • Hardware, firmware, manufacturing, diags and QA collaboration
  • SCM (Git, Perforce)
  • Project management tools (Jira)
  • x86/ARM system architecture
  • C/C++
  • Python

Nice to have

  • Experience leading bring-up for sophisticated compute architectures like GB200 NVL72

What the JD emphasized

  • 5+ years of relevant experience managing systems/platform software teams, ideally in server bring up, firmware development, or data center solutions
  • BS, MS, or PhD in EE/CS or related field (or equivalent experience) with 12+ overall years of experience
  • Proven track record of delivering scalable server products and solutions for large scale data centers
  • Excellent written and oral communication skills
  • Hands-on experience with x86/ARM system architecture and coding (C/C++, Python)
  • Proven excellence in server architecture, collaborating across teams for delivering server products as per defined Key Performance Indicators (KPIs)