Solutions Architect – AI Factory

NVIDIA NVIDIA · Semiconductors · Germany +4 · Remote

Solutions Architect role focused on designing, building, and operationalizing large-scale AI factories and GenAI/Agentic AI solutions for enterprise customers, leveraging NVIDIA's technology stack. This involves hands-on work with compute, networking, software, and cluster management tools.

What you'd actually do

  1. Guiding customers in their adoption of NVIDIA's compute, networking, and software stacks to deliver end-to-end GenAI and Agentic AI solutions.
  2. Using cloud native methodologies, low latency networks, and accelerated compute to help build modern AI factories.
  3. Delivering demos, assisting with proof-of-concepts, or writing papers and developer blogs.
  4. Collaborating with executives and engineering, we solve complex problems and help bring NVIDIA's premiere technologies to life in the cloud and in the datacenter.
  5. Solve the problems that nobody else has solved yet.

Skills

Required

  • MS, or PhD in Engineering, Computer Science, or a related field (or equivalent experience).
  • Established track record working with AI and HPC clusters, both on-premises and cloud based.
  • 8+ years of proven experience with cluster management and related tools, including Docker Containers, Slurm, Kubernetes, and Ansible.
  • Hands-on experience with Datacenter MEP, network, storage, cluster configuration and debugging.
  • Strong analytical and problem-solving skills, along with an ability to articulate what you know to others.
  • Ability to multitask efficiently in a dynamic environment.

Nice to have

  • Strong coding and debugging skills, including experience with CUDA, Python, C/C++, Bash, AI frameworks and Linux utilities.
  • Demonstrated expertise through projects or Open Source contributions involving GPU workloads, Kubernetes, InfiniBand, Ethernet, or other areas related to high-performance clusters and hybrid cloud solutions.
  • Exhibit hands on experience with NVIDIA Enterprise software products, Base Command Manager, Run:ai and NVIDIA NIMs.

What the JD emphasized

  • large scale AI factories
  • GenAI and Agentic AI solutions
  • modern AI factories
  • AI and HPC clusters
  • GPU workloads

Other signals

  • designing, building, and maintaining large scale AI factories
  • deploy and operationalize AI solutions at scale
  • end-to-end GenAI and Agentic AI solutions
  • modern AI factories
  • AI and HPC clusters