Senior Solutions Architect - AI Factory Deployment

NVIDIA NVIDIA · Semiconductors · CA +3 · Remote

Senior Solutions Architect focused on deploying and validating AI factories, specifically running and debugging AI/LLM workloads on GPU clusters. Responsibilities include setting up environments, executing benchmarks, resolving performance issues, building observability, and recommending optimizations.

What you'd actually do

  1. Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters.
  2. Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis.
  3. Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform.
  4. Build and improve observability for AI factories (metrics, logs, traces, dashboards) to understand workload behavior and system health.
  5. Recommend changes to job configuration, parallelism strategies, and cluster settings to improve throughput, latency, and scaling efficiency.

Skills

Required

  • Managing Linux-based systems in HPC, distributed systems, or extensive AI/ML settings
  • Running AI/ML workloads on multi-GPU and/or multi-node clusters
  • NCCL
  • Collective communication patterns (AllReduce, AllToAll)
  • LLM training and/or inference workflows
  • PyTorch or TensorFlow
  • Python
  • Shell/Bash scripting
  • Benchmarking
  • Observability data (metrics, logs, dashboards)

Nice to have

  • AI factory or large-scale AI infrastructure build, deployment, or operations
  • HPC performance engineering
  • SRE
  • Systems performance analysis for GPU-accelerated environments
  • Observability stacks (metrics/monitoring, logging, tracing systems)
  • Automation and CI-style pipelines for running and validating benchmarks at scale

What the JD emphasized

  • Hands-on experience running AI/ML workloads on multi-GPU and/or multi-node clusters
  • Solid grasp of collective communication patterns, particularly AllReduce and AllToAll
  • Experience with benchmarking (crafting, executing, and interpreting performance benchmarks)
  • Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and optimize complex distributed workloads

Other signals

  • AI factories
  • LLM workloads
  • GPU clusters
  • performance and scalability
  • observability and automation
  • benchmarks and validation