Hpc Solution Architect – AI Infrastructure(s2s)

Designs and drives deployment of GPU-accelerated AI factories and high-performance computing infrastructure, partnering with AI specialists and ecosystem partners to shape end-to-end solutions for clients. Focuses on technical solution strategy, architecture, and pre-sales for private AI assets.

What you'd actually do

  1. Leading architecture for pursuits and active opportunities, including discovery, requirements, constraints, and target-state design
  2. Creatively defining reference architectures for on-premises, cloud, and hybrid GPU platforms across compute, network, storage, security, software and operations
  3. Driving architecture trade-offs and decisions across performance, scalability, reliability, locality, total cost of ownership, time-to-value, and risk
  4. Owning the technical solution strategy in proposals and RFPs, including architecture narrative, assumptions, dependencies, sizing guidance, and delivery approach
  5. Facilitating client workshops and technical reviews and translating engineering detail into executive-ready communications

Skills

Required

  • 10+ years of experience in infrastructure architecture or engineering for large-scale platforms including design, implementation, operations, and optimization.
  • 4+ years designing or delivering GPU-accelerated platforms for AI, ML, or high-performance computing
  • 3+ years Linux system administration in production environments
  • 3+ years designing or operating distributed compute clusters for AI/HPC in hybrid cloud setups, including multi-GPU topologies, partitioning, scheduler integration, and scalability for edge-to-cloud workloads.
  • 2+ years with high-performance networking or storage for AI/HPC
  • 2+ years building containerized platforms using Kubernetes or Red Hat OpenShift, including GPU operators/drivers, CUDA container runtime, and cluster lifecycle automation
  • 2+ years automating infrastructure as code(IaC) with tools like Terraform and Ansible
  • At least 2 end-to-end deployments of reference architectures in the cloud or on-prem, including variants with security controls, network segmentation, operational runbooks, and validation testing
  • Experience in pre-sales or sales engineering, including discovery, solution demonstrations, and proposal/RFP contributions

Nice to have

  • 2+ years implementing AI/HPC cluster scheduling (Slurm and Kubernetes), including multi-tenant queues, quotas, and GPU-aware policies
  • 2+ years supporting generative AI infrastructure patterns, including multi-node distributed training
  • Experience with AI agents and frameworks
  • Experience with high-throughput storage for AI/HPC
  • Experience executing NVIDIA co-sell motions with OEMS (Dell, HPC, Lenovo), CSPs ( AWS, Azure, Google Cloud), or independent software vendors ( Run:ai, OpenShift, Weights & Biases)

What the JD emphasized

  • GPU-accelerated platforms for AI
  • distributed compute clusters for AI/HPC
  • high-performance networking or storage for AI/HPC
  • generative AI infrastructure patterns
  • AI agents and frameworks

Other signals

  • GPU-accelerated AI factories
  • high-performance computing infrastructure
  • private AI assets
  • large-scale, private AI and data platforms