Senior Solutions Architect, Csp System

NVIDIA NVIDIA · Semiconductors · Shenzhen, China +1

Senior Solutions Architect focused on building and optimizing Kubernetes infrastructure for Agentic AI and Agentic RL workloads, working with Cloud Service Providers in China.

What you'd actually do

  1. Lead the design, development, and optimization of Kubernetes-based infrastructure solutions for Agentic AI and Agentic RL workloads, addressing core challenges including massive concurrent sandbox scheduling, millisecond-level elasticity, secure isolation, and full-scenario interactive environment support.
  2. Collaborate closely with NVIDIA’s CSP partners (major cloud service providers in China) to understand their Agentic AI/RL business needs, provide professional K8s technical guidance, and tailor infrastructure solutions that align with NVIDIA’s accelerated computing technologies (such as NVIDIA AI Enterprise, GB200 platform, and NVCF).
  3. Optimize Kubernetes clusters to support high-throughput, low-latency Agentic RL training and inference workloads, including resource scheduling strategy optimization, GPU resource management, network and storage performance tuning, and solving bottlenecks in large-scale Pod creation and scheduling.
  4. Design and implement Agent Infra core components based on K8s, such as secure sandbox environments, interactive trajectory recording, checkpoint breakpoint replay, and full-link observability tools, to support the end-to-end lifecycle of Agentic AI/RL development and deployment.
  5. Work with cross-functional teams (NVIDIA’s R&D, solution architecture, and technical support teams) to promote the integration of K8s with NVIDIA’s software and hardware ecosystem, including NVIDIA Operators, Dynamo, Grove, and KAI Scheduler, to achieve optimal performance of Agentic workloads.

Skills

Required

  • 10+ years of hands-on experience in Kubernetes development, operation, and optimization, with deep expertise in K8s core components (kube-apiserver, etcd, kube-scheduler, kubelet) and custom resource development (CRD/Operator).
  • Proven experience in building and optimizing infrastructure for AI/ML workloads, with in-depth understanding of Agentic AI and Agentic RL concepts, and practical experience in supporting Agentic RL training or inference workloads on K8s is a strong plus.
  • Proficiency in containerization technologies (Docker, containerd), container network solutions (Calico, Cilium), and storage solutions (Ceph, GlusterFS), with experience in optimizing network and storage performance for high-concurrency AI workloads.
  • Strong experience in GPU resource management on K8s, familiar with NVIDIA GPU Operator, CUDA, and accelerated computing technologies, and able to optimize GPU utilization for Agentic AI/RL workloads.
  • Excellent programming skills, proficient in at least one programming language (Python, Go, C++), with the ability to develop custom K8s controllers, plugins, or automation tools.
  • Deep understanding of cloud-native architecture and best practices, experience in working with major CSPs (Alibaba Cloud, Tencent Cloud, Huawei Cloud, etc.) is highly preferred.
  • Fluent in spoken and written English, able to communicate effectively with global cross-functional teams and read technical documentation in English.
  • Strong problem-solving skills, ability to identify and resolve complex K8s and Agentic AI/RL Infra technical issues independently, and a proactive and result-driven work attitude.

Nice to have

  • Master’s degree is preferred.
  • Experience in building Agentic AI/RL sandbox environments, familiar with sandbox technologies and their integration with K8s.
  • Experience in large-scale data center infrastructure management, with an understanding of the challenges of pulse-type workload scheduling and cost optimization in Agentic RL scenarios.
  • Familiar with Agentic AI frameworks and RL frameworks, able to align K8s infrastructure with framework requirements.
  • Relevant certifications such as CKA (Certified Kubernetes Administrator), CKAD (Certified Kubernetes Application Developer), or CKS (Certified Kubernetes Security Specialist).

What the JD emphasized

  • Agentic AI
  • Agentic RL
  • Kubernetes
  • infrastructure

Other signals

  • Agentic AI/RL workloads
  • Kubernetes infrastructure
  • CSP partners