Manager, Software Engineering - Production AI Inference

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Manager for a software engineering team focused on production AI inference for NVIDIA Inference Microservices (NIM). The role involves leading the team responsible for shipping production-ready LLM NIMs, including model onboarding, serving stack integration, performance optimization, release quality, and operational health. Requires strong experience in production software, AI/ML fundamentals, and managing engineering teams.

What you'd actually do

  1. Lead the team responsible for shipping production-ready LLM NIMs, including planning, new model onboarding, validated serving recipes, release readiness, and post-release follow-through.
  2. Build a predictable operating model for the team through roadmap planning, a weekly execution rhythm, launch checklists, clear ownership boundaries, collaborator communication, and issue management.
  3. Own project execution by anticipating schedule, staffing, and dependency risks. Adapt plans under pressure and collaborate with peer managers to dynamically prioritize engineering timelines to remain agile in the fast paced AI industry.
  4. Drive continuous improvement in production workflows through RCCA and partner feedback, removing unnecessary and redundant work while keeping the team passionate about production outcomes.
  5. Build and maintain a world-class AI inference engineering team by building an innovative culture, setting clear expectations, maintaining active feedback loops, and mentoring engineers and emerging leaders.

Skills

Required

  • 10+ overall years building production software
  • 3+ years of managing software engineering teams
  • Experience delivering production software with strong quality, reliability, and release expectations
  • Experience driving process improvements, and improving operational efficiency
  • Excellent communication and collaborator management; ability to influence executive leadership across product, research, security, and operations
  • Deep understanding of AI/ML fundamentals, innovative model architectures, inference engine/kernel, performance optimization strategies, accelerated computing, large-scale distributed systems, and security hardening
  • A degree in Computer Science, Computer Engineering, or a related field (BS or MS) or equivalent experience

Nice to have

  • Built and managed globally distributed organizations
  • established durable engineering processes that significantly improved quality and velocity across multiple teams
  • Recognized industry leader with contributions to open-source ecosystems (i.e vLLM, SGLang, TensorRTLLM, Dynamo, Triton, PyTorch), technical publications, or talks in containers, Kubernetes, GPU, or inference communities
  • Drove measurable performance improvements for large-scale LLM inference systems, including latency, throughput, GPU utilization, cost efficiency, and performance regression prevention across production releases
  • Hands-on experience with core GPU technologies such as CUDA, cuDNN, CUTLASS, cuBLAS, NCCL, NIXL, NVLink, and GPUDirect RDMA
  • Hands-on experience delivering enterprise or government-ready AI software, including FedRAMP, air-gapped deployments, regulated environments, security hardening, compliance evidence, and production support expectations

What the JD emphasized

  • production AI inference
  • production software
  • AI/ML fundamentals
  • inference engine/kernel
  • performance optimization strategies
  • large-scale distributed systems
  • security hardening
  • measurable performance improvements for large-scale LLM inference systems
  • latency
  • throughput
  • GPU utilization
  • cost efficiency
  • performance regression prevention
  • enterprise or government-ready AI software
  • FedRAMP
  • air-gapped deployments
  • regulated environments
  • security hardening
  • compliance evidence
  • production support expectations

Other signals

  • production AI inference
  • NVIDIA Inference Microservices (NIM)
  • enterprise-supported AI inference
  • optimized inference engines
  • model profiles/recipes
  • validated runtime configurations
  • security hardening
  • shipping production-ready LLM NIMs
  • serving stack integration
  • performance profiling/optimization
  • release quality
  • security readiness
  • automation
  • observability
  • operational health
  • day-0 model launches repeatable
  • raise the production bar for every NIM release