Senior Solutions Architect, Customer Success

NVIDIA NVIDIA · Semiconductors · Dubai, United Arab Emirates +1 · Remote

Senior Solutions Architect, Customer Success role at NVIDIA focusing on advising customers on large-scale AI/HPC infrastructure projects, including networking, system design, and automation. The role involves assessing infrastructure needs, architecting solutions, and guiding implementation to ensure high availability and efficiency of GPU-accelerated systems.

What you'd actually do

  1. Serve as a senior technical authority and trusted consultant on NVIDIA technologies, contributing to architecture reviews, guiding infrastructure decisions at scale, and providing strategic recommendations aligned with each customer’s business objectives.
  2. Establish and refine monitoring and optimization methodologies using analytics, telemetry, and automation to proactively detect bottlenecks, improve infrastructure resiliency, and drive continuous operational maturity.
  3. Lead and advise on the analysis, optimization, and performance tuning of complex GPU-accelerated systems and AI workloads, ensuring high availability and efficiency across customer data centers.
  4. Facilitate post-deployment reviews, incident retrospectives, and strategy sessions to shape the customer experience and deliver actionable insights into NVIDIA’s infrastructure roadmap.
  5. Own and lead complex technical projects end-to-end—from initial discovery and solution design through implementation, knowledge transfer, and continuous improvement—ensuring alignment to SLAs and proactive mitigation of technical risks.

Skills

Required

  • BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields
  • 10+ years of professional experience in large-scale data center service operations with a focus on infrastructure
  • Demonstrated hands-on experience deploying, configuring, and optimizing NVIDIA GPU-accelerated infrastructure, including driver and firmware management, CUDA toolkit integration, and GPU workload profiling and fix
  • Track record of building long-term customer relationships and driving adoption through consultative engagement
  • Strong analytical and decision-making capabilities, with a demonstrable ability to identify root causes, drive continuous improvement, and deliver resilient technical solutions
  • Expertise in end-to-end data center architecture, spanning operating systems, Linux kernel drivers, GPU and NIC hardware, high-speed networking (InfiniBand, Ethernet, RDMA), and storage systems (Lustre, GPFS, NFS)
  • Good communication, time management, and organizational skills, with the ability to lead complex multi-functional projects, guide technical teams, and present to executive partners
  • Willingness to travel up to 25% for customer engagements

Nice to have

  • Experience with Kubernetes for container orchestration, resource scheduling, and integration with GPU-accelerated workloads
  • Familiarity with observability stacks (Grafana, Prometheus, Loki) for monitoring, alerting, and building fault-tolerant systems
  • Experience with multi-tenant GPU cluster management and workload scheduling frameworks
  • Experience with NVIDIA Base Command Manager (BCM) for provisioning, managing, and monitoring GPU clusters at scale
  • Background with RDMA-based fabrics (InfiniBand or RoCE) in HPC or AI environments as well as knowledge of CI/CD pipelines, Infrastructure-as-Code (Terraform, Ansible), and GitOps workflows for infrastructure automation

What the JD emphasized

  • NVIDIA GPU Expertise
  • Customer Engagement
  • Analytical & Problem-Solving Skills
  • System & Infrastructure Proficiency
  • Leadership & Communication