Solutions Architect, AI Cloud Partner Performance

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

NVIDIA Solutions Architect focused on enabling cloud partners to achieve elite performance and reliability for AI workloads, particularly LLM training and inference, by adopting reference architectures and optimizing GPU clusters.

What you'd actually do

  1. Work closely with NVIDIA Cloud Partners (NCP), as a compute and networking performance specialist, ensuring they are reaching high standards for performance and accomplishing their business goals.
  2. Enable NCPs to achieve Exemplar Cloud status through demonstration of performance capabilities with respect to reference benchmarks.
  3. Accelerate NCP onboarding time by resolving deviations from reference performance targets.
  4. Improve NVIDIA Cloud Partner cluster manageability, and reliability by advising customers on application of available solutions.
  5. Scale knowledge, reach, and opportunities by educating internal teams and communities on NVIDIA Reference Architectures and Exemplar Cloud program.

Skills

Required

  • BS, MS, or Ph.D. degree in Engineering, Mathematics, Physics, Computer Science, Data Science (or equivalent experience)
  • 5+ years of proven experience with one or more Cloud Service Providers (AWS, Azure, GCP or OCI), NCPs (CoreWeave, Lambda Labs, Crusoe, etc) and cloud-native architectures and software.
  • Experience leading joint debugging and optimization sessions with partners, driving the resolution of distributed training bottlenecks and fabric anomalies.
  • Expertise in performance tuning of RDMA-enabled GPU clusters including running performance benchmarks and diagnosing performance issue with compute and network tracing tools.
  • Strong coding and outstanding debugging skills.
  • Proficiency in LLM training and inference workloads, Slurm, Kubernetes, MPI, NCCL.
  • Linux-based configuration, management, monitoring, and system administration with proficiency in problem-solving in both bare metal and virtual environments
  • Understanding of networking fundamentals (e.g. router, firewall, load balancer, DNS, VPN) for high performance infrastructure

Nice to have

  • Ability to perform root cause analysis on distributed training failures using Nsight Systems and NCCL-tests, applying a detailed divide-and-conquer approach to isolate network/fabric issues
  • Experience running LLM Benchmarks, NCCL-tests, and automating RDMA diagnostic tools.
  • Background with deploying and configuring observability tooling including Grafana, Prometheus, W&B, Nagios, Zabbix
  • Ability to take ownership when resolving cluster downtime or degraded performance with customers

What the JD emphasized

  • LLM training and inference workloads
  • performance tuning
  • distributed training bottlenecks
  • fabric anomalies
  • debugging

Other signals

  • partner enablement
  • performance optimization
  • reference architectures
  • LLM training and inference