Solutions Architect, Infrastructure

NVIDIA NVIDIA · Semiconductors · Redmond, WA +2

This role focuses on the infrastructure and deployment of NVIDIA's Data Center GPUs and networking platforms for large customers, bridging early platform readiness with cloud engineering and customer adoption. It involves hands-on infrastructure expertise, multi-functional leadership, and problem-solving across hardware, networking, and system software.

What you'd actually do

  1. Lead end‑to‑end execution for Hyperscaler customers to rapidly bring NVIDIA Data Center GPU and networking platforms to market at scale.
  2. Drive strategic partnership and alignment with Product teams to understand roadmap intent, co‑define critical metrics, and ensure unified direction across technical, sales, and leadership organizations.
  3. Influence without authority across Product, Engineering, Sales, Operations, and CSP customers, driving clarity, alignment, and unblock paths for scale‑up.
  4. Analyze deployment and performance data, identifying product health trends, system bottlenecks, and operational risks.
  5. Solve challenging technical problems involving GPUs, networking, drivers, containers, firmware, and distributed system interactions.

Skills

Required

  • Solutions Architecture
  • Infrastructure Engineering
  • bring-up and validation of large-scale NVIDIA GPU platforms
  • multi-GPU and multi-node architectures
  • high-performance networking technologies (e.g., RDMA, congestion control, high-bandwidth interconnects)
  • Linux systems tools
  • server hardware architecture
  • BMC/IPMI/Redfish
  • Linux fundamentals across drivers, kernel subsystems, cgroups, containers, and node‑level performance analysis

Nice to have

  • multi-functional leadership
  • early platform readiness
  • cloud engineering teams
  • product strategy
  • large-scale customer deployments
  • NVIDIA technologies
  • worldwide cloud hosting providers
  • large enterprise environments
  • Product
  • Engineering
  • Sales
  • Operations
  • CSP customers
  • executive-level communication
  • future improvements in platform design, validation, and operational workflows
  • CUDA
  • NCCL
  • NVSwitch/NVLink
  • driver behavior
  • performance tuning
  • dmesg
  • journalctl
  • lspci
  • numactl
  • ethtool
  • iostat
  • perf
  • nvidia-smi
  • top/htop
  • ipmitool
  • container‑level tooling
  • PCIe topologies
  • system firmware
  • NUMA
  • BIOS/UEFI configuration
  • power/thermal envelopes
  • memory/subsystem behavior
  • remote management
  • hardware health monitoring
  • out‑of‑band debugging
  • cgroups
  • containers
  • node‑level performance analysis
  • cluster
  • node
  • accelerator
  • network
  • application layer
  • Compute and networking infrastructure
  • Instance types
  • networking primitives
  • high‑performance communication paths
  • Hyperscalers
  • Cloud Service Providers
  • multi-team infrastructure challenges
  • customer groups
  • GPU or infrastructure products
  • pilot to high‑volume deployment
  • large data center environments
  • modern deep learning
  • LLM architectures
  • distributed training/inference challenges at scale

What the JD emphasized

  • bring-up and validation of large-scale NVIDIA GPU platforms
  • high-performance networking technologies
  • NVIDIA system software stacks
  • Linux systems tools
  • server hardware architecture
  • BMC/IPMI/Redfish
  • Strong Linux fundamentals
  • identify performance bottlenecks
  • taking GPU or infrastructure products from pilot to high‑volume deployment