Senior Solutions Architect, Cloud Infrastructure and Devops

NVIDIA NVIDIA · Semiconductors · Dubai, United Arab Emirates +1 · Remote

NVIDIA is seeking a Senior Cloud Infrastructure and DevOps Solutions Architect to advise on and guide the implementation of large-scale computational and AI infrastructure, focusing on Kubernetes-based platforms and automation for AI/HPC systems.

What you'd actually do

  1. Advise on and help maintain large-scale computational and AI infrastructure, including monitoring, logging, and workload orchestration (Kubernetes and Linux job schedulers).
  2. Provide consultative guidance and perform hands-on solving across the full stack—from bare metal and operating system, through the software stack, container platform, networking, and storage.
  3. Assess customer environments and recommend optimized, production-ready Kubernetes-based container platforms integrated with enterprise-grade networking and storage solutions.
  4. Serve as a key technical resource: develop, refine, and document standard methodologies and operational guidelines to be shared with internal teams and customer partners.
  5. Support Research & Development activities and engage in POCs/POVs to validate new features, architectures, and upgrade approaches.

Skills

Required

  • BS/MS/PhD in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields (or equivalent experience)
  • 8+ years of professional experience in leading scalable cloud environments and automation engineering roles
  • Shown understanding of networking fundamentals, data center architectures
  • hands-on experience leading HPC/AI clusters, including deployment, optimization, and solving
  • Validated hands-on experience deploying, configuring, and optimizing NVIDIA GPU-accelerated infrastructure, including driver management, CUDA toolkit integration, and GPU workload profiling
  • Extensive experience with Kubernetes for container orchestration, resource scheduling, scaling, and integration with GPU-accelerated and HPC environments
  • Strong familiarity with HPC and AI technologies (CPUs, GPUs, high-speed interconnects) and supporting software stacks
  • Deep knowledge of Linux (RedHat, Ubuntu), OS-level security, and protocols
  • Experience with storage solutions such as Lustre, GPFS, ZFS, XFS, and emerging Kubernetes storage technologies
  • Proficiency in Python and Bash scripting, configuration management, and Infrastructure-as-Code tools (e.g., Ansible, Terraform)
  • Experience with observability stacks (Grafana, Loki, Prometheus) for monitoring, logging, and building fault-tolerant systems
  • Strong background in crafting scalable solutions and providing consultative support to customers, including leading architectural reviews and speaking publicly to executive partners

Nice to have

  • Knowledge of CI/CD pipelines for software deployment and automation
  • Experience working with NVIDIA GPU and Network Operators to manage automated resource lifecycle in Kubernetes environments
  • Solid hands-on knowledge of Kubernetes and container-based microservices architectures
  • Experience with NVIDIA Base Command Manager (BCM) for provisioning, managing, and supervising GPU clusters at scale
  • background with RDMA-based fabrics (InfiniBand or RoCE) in HPC or AI environments

What the JD emphasized

  • hands-on experience leading HPC/AI clusters
  • Extensive experience with Kubernetes for container orchestration
  • Validated hands-on experience deploying, configuring, and optimizing NVIDIA GPU-accelerated infrastructure

Other signals

  • AI/HPC systems
  • Kubernetes
  • GPU-accelerated infrastructure