Manager, Solutions Architecture - Data Center Specialists

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +2 · Remote

Manager for a team of infrastructure experts focused on delivering NVIDIA-powered AI Factories, advising partners on large-scale AI/HPC projects, and understanding AI workloads in relation to data center infrastructure.

What you'd actually do

  1. Managing and developing a group of infrastructure and HPC specialists ;
  2. Providing guidance and support to partners, helping them successfully deploy and bring up AI Factories;
  3. Helping our partners employ our best practices and reference architectures and taking your knowledge out to the field;
  4. Raising and providing timely advance alerts of critical customer issues that need further focus.

Skills

Required

  • BS/MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, other Engineering fields.
  • 8+ overall years work or research experience with Python/ C++ / other software development.
  • 4+ years of experience leading a team.
  • Track record of medium to large scale AI training and understanding of key libraries used for NLP/LLM/VLA training (NeMo Framework, DeepSpeed etc.)
  • Experience with integration and deployment of software products in production enterprise environments, and microservices software architecture.
  • Solid understanding of data center infrastructure: servers, storage, networking, cabling, power, cooling, and physical deployment workflows.
  • Experience with software microservices and with the incorporation and delivery of software in production environments
  • Technical leadership and strong understanding of NVIDIA technologies, and success in working with customers.
  • Excellent verbal, written communication, and technical presentation skills in English.

Nice to have

  • Understanding of HPC systems: data center design, high speed interconnect InfiniBand, Cluster Storage and Scheduling related design and/or management experience.
  • Strong coding and debugging skills, and demonstrated expertise in one or more of the following areas: Machine Learning, Deep Learning, Slurm, Docker/Kubernetes, Kubernetes, Singularity, MPI, MLOps, LLMOps, Ansible, Terraform, and other high-performance AI cluster solutions.
  • Hands-on experience with HPC clusters, InfiniBand, GPU infrastructure, or hyperscale data center technologies.
  • Experience in AI infrastructure deployment, professional services, or tech vendor post-sales delivery.

What the JD emphasized

  • Track record of medium to large scale AI training
  • experience with integration and deployment of software products in production enterprise environments
  • Solid understanding of data center infrastructure
  • Experience with software microservices and with the incorporation and delivery of software in production environments
  • Technical leadership and strong understanding of NVIDIA technologies
  • AI infrastructure deployment

Other signals

  • AI Factories
  • large-scale HPC and AI infrastructure
  • AI adoption to the enterprise
  • AI/HPC projects
  • AI workload
  • AI training
  • AI infrastructure deployment