Solutions Architect, Inference Deployments

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

This role focuses on building and deploying AI inference solutions at scale using NVIDIA's GPU technology and Kubernetes. The Solutions Architect will collaborate with engineering, DevOps, and customers to optimize and serve generative AI models, ensuring low-latency inference in enterprise environments.

What you'd actually do

  1. Build inference pipelines with tools like NVIDIA Dynamo, distributing tasks among GPU workers to improve efficiency.
  2. Collaborate with DevOps teams to orchestrate disaggregated inference using Kubernetes for complex workloads.
  3. Accelerate inference pipelines using TensorRT-LLM, vLLM, SGLang, and other backends to ensure seamless integration with disaggregated inference.
  4. Provide mentorship and technical leadership to customers and internal teams, guiding them through the deployment of disaggregated inference systems and resolving complex issues.

Skills

Required

  • Solutions Architecture
  • deploying distributed systems
  • AI inference workloads on Kubernetes
  • NVIDIA Dynamo
  • Triton Inference Server
  • TensorRT-LLM
  • model optimization
  • model serving
  • GPU orchestration
  • NVIDIA GPU Operator
  • NIM Operator
  • Multi-Instance GPU (MIG) partitioning
  • GPU allocation
  • memory hierarchies
  • low-latency networking
  • tuning large language models
  • low-latency inference
  • enterprise environments
  • BS in CS/Engineering or equivalent experience

Nice to have

  • NVIDIA inference technologies (Dynamo, NIM, NIXL, Grove)
  • transformer neural network
  • quantization
  • speculative decoding
  • WideEP
  • NVIDIA Certified AI Engineer
  • open-source contributions (NVIDIA Dynamo, vLLM, KServe, SGLang)

What the JD emphasized

  • deploying distributed systems and AI inference workloads on Kubernetes
  • low-latency inference

Other signals

  • deploying AI inference solutions at scale
  • delivering generative AI to production
  • accelerate inference pipelines