AI Factory Cpu Focused Solutions Architect

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

This role focuses on designing, building, and maintaining large-scale HPC and AI infrastructure, specifically CPU-based solutions within the NVIDIA AI Factory. The Solutions Architect will enable customers in adopting end-to-end AI solutions, operationalizing large compute resources, and overcoming adoption barriers. The role involves deep technical understanding of NVIDIA's stacks and AI workflows.

What you'd actually do

  1. Our day-to-day work involves helping our partners be successful in their adoption of end-to-end AI solutions using NVIDIA's compute, networking, and software stacks.
  2. For this particular role, that means having a deep technical understanding of NVIDIA Reference Architectures, and using that understanding to enable customers adopting our CPU-based solutions as part of the overall NVIDIA AI Factory.
  3. This is a multi-faceted role necessitating being comfortable working on not just hardware and software elements, but also the larger AI workflow and operationalization of large scale compute resources.
  4. We succeed when we help our customers overcome barriers to adopting our best known methods.
  5. As the technical leader for the CPU components within the NVIDIA AI Factory, you will play an instrumental role in driving that success.

Skills

Required

  • Experience with defining, deploying, and testing large scale reference architectures for High Performance Computing and AI.
  • A track record of defining and using MLOps and AI workflow tools and processes.
  • 6 or more years of hands-on expertise with modern data center architectures and interaction between CPUs, GPUs, and networking.
  • Strong foundational expertise and a BS, MS, or equivalent experience in Engineering.
  • Strong analytical and problem-solving skills
  • Ability to articulate what you know to others.
  • Ability to multitask efficiently in a multifaceted environment.
  • Experienced with organizing, presenting, and discussing technical materials with groups that can be comprised of a range of technical capability.
  • Flexibility to adapt in fluid situations, especially with partners or customers.
  • Comfortable with occasional travel to customer sites.

Nice to have

  • Hands-on experience with Arm-based server processors and the Arm software ecosystem.
  • Proficiency with tooling, automation, and performance testing for large-scale clusters, preferably using AI tools.
  • Deep understanding of Agentic AI and inference workflows.
  • Experience building, using, and explaining reinforcement learning.
  • Willingness and ability to learn quickly as we address sophisticated problems, and an understanding of how all elements of the AI Factory interact with each other.

What the JD emphasized

  • large-scale HPC and AI infrastructure
  • deploy and operationalize AI solutions at scale
  • NVIDIA Reference Architectures
  • CPU-based solutions
  • larger AI workflow
  • operationalization of large scale compute resources
  • defining, deploying, and testing large scale reference architectures for High Performance Computing and AI
  • defining and using MLOps and AI workflow tools and processes
  • modern data center architectures
  • interaction between CPUs, GPUs, and networking
  • Agentic AI and inference workflows

Other signals

  • designing, building, and maintaining large-scale HPC and AI infrastructure
  • deploy and operationalize AI solutions at scale
  • technical leader for the CPU components within the NVIDIA AI Factory
  • defining, deploying, and testing large scale reference architectures for High Performance Computing and AI
  • defining and using MLOps and AI workflow tools and processes