Lead Principal Engineer, Enterprise Agentic AI Platform

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Lead Principal Engineer for Enterprise Agentic AI Platform at NVIDIA, focusing on building and scaling production-grade agentic AI systems, including multi-agent orchestration, memory systems, and evaluation pipelines. Requires deep expertise in distributed systems, Kubernetes, GPU inference, and hands-on coding in Python/Go.

What you'd actually do

  1. Develop and deliver production-quality agentic AI systems from start to finish using Python and/or Go, covering Kubernetes deployment, agent runtimes, memory systems, orchestration, tool integration, and evaluation pipelines.
  2. Define and advance NVIDIA’s Enterprise Agentic AI architecture through practical implementations, reference systems, and production deployments—not abstract diagrams.
  3. Build and implement multi-agent orchestration patterns (planner, executor, reviewer, tool agents) using frameworks such as LangChain, LangGraph, or similar orchestration systems, with strong regression coverage and observability.
  4. Run fast, high-quality POCs on emerging agent architectures; harden successful patterns into reusable platform services, APIs, SDKs, and developer templates.
  5. Architect and implement data flywheels that continuously improve agent quality through telemetry, benchmarking, automated evaluation, and structured feedback loops.

Skills

Required

  • Python
  • Go
  • Kubernetes
  • distributed systems
  • agentic AI systems
  • RAG pipelines
  • multi-agent management
  • LangChain
  • LangGraph
  • evaluation infrastructure
  • containerized workloads
  • networking
  • APIs
  • enterprise integration patterns
  • benchmarking
  • regression testing
  • telemetry
  • observability systems
  • performance tuning
  • GPU-based inference systems

Nice to have

  • Master’s or PhD
  • Cursor
  • Claude Code
  • Claude Cowork
  • developer-acceleration components
  • SDKs
  • APIs
  • templates
  • reference implementations
  • CI/CD automation
  • enterprise vector databases
  • retrieval systems
  • Glean
  • Microsoft Copilot Studio
  • Google Agentspace
  • fine-grained policy enforcement
  • access controls
  • sandbox isolation
  • audit trails
  • GPU-acceleration
  • model inference optimization
  • batching strategies
  • memory utilization
  • efficiency on NVIDIA hardware
  • open-source contributions

What the JD emphasized

  • deeply involved technical leader writing code daily
  • grasp infrastructure aspects from Kubernetes to GPU inference stacks
  • define enterprise-grade agentic AI at NVIDIA scale
  • rapidly move from concepts to operational systems
  • invent systems that incorporate persistent memory, controlled runtime environments, strict assessment, and GPU-powered performance
  • Proven skill in quickly transitioning from an idea to a functional prototype and then to a robust, scalable platform solution.
  • Proven track record in constructing agentic AI systems, including RAG pipelines, long-lasting memory models, multi-agent management (e.g., LangChain, LangGraph), tool frameworks, and evaluation infrastructure.
  • Expert-level depth in Kubernetes, containerized workloads, networking, APIs, and secure enterprise integration patterns.
  • Comprehensive knowledge of performance tuning in hybrid environments, including GPU-based inference systems.

Other signals

  • Develop and deliver production-quality agentic AI systems from start to finish
  • Define and advance NVIDIA’s Enterprise Agentic AI architecture
  • Build and implement multi-agent orchestration patterns
  • Architect and implement data flywheels that continuously improve agent quality