Inference Optimization Architect, Speech AI

NVIDIA NVIDIA · Semiconductors · Bangalore, India +2

NVIDIA is seeking an Inference Optimization Architect to accelerate and scale Speech AI models, focusing on reducing inference latency, improving throughput, and optimizing resource utilization across AI infrastructure. The role involves implementing model compression techniques, developing custom kernels, designing serving infrastructure, and optimizing inference across diverse GPU platforms.

What you'd actually do

  1. Optimize Inference Performance: Improve streaming latency and throughput through advanced batching strategies, encoder caching, and multi-threaded pipeline optimizations
  2. Model Compression: Implement techniques including quantization, pruning, and knowledge distillation.
  3. Benchmarking: Profile and benchmark models to identify and resolve performance bottlenecks. GPU profiling and debugging using Nsight Systems and Nsight Compute
  4. Hardware Acceleration: Develop custom kernels and leverage hardware acceleration (CUDA, TensorRT, etc.).
  5. Infrastructure Design: Design and implement efficient serving infrastructure for Speech models at scale.

Skills

Required

  • Masters or BE/BTech in Computer Science, computer architecture, or related field
  • 10+ years of total experience & 5+ years on performance optimizations of Deep learning model inference
  • Experience with inference pipelines for LLM, Speech Recognition & Speech Synthesis
  • CUDA kernel development: thread blocks, shared memory, synchronization
  • Model inference optimization: batching, dynamic shapes, latency tuning
  • Model serving and deployment: Triton, TorchServe, TensorRT, TRT-LLM, vLLM
  • Model optimization techniques: quantization, pruning, distillation
  • Computer architecture & Operating systems: processes, threads, scheduling, memory management
  • Solid understanding of modern model architectures (Transformers, CNNs, RNNs)

Nice to have

  • Publications or contributions to open-source projects like pytorch/jax/triton-lan
  • Experience with embedded systems or edge deployment
  • Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment

What the JD emphasized

  • 10+ years of total experience & 5+ years on performance optimizations of Deep learning model inference
  • Experience with inference pipelines for LLM, Speech Recognition & Speech Synthesis
  • CUDA kernel development: thread blocks, shared memory, synchronization
  • Model inference optimization: batching, dynamic shapes, latency tuning
  • Model serving and deployment: Triton, TorchServe, TensorRT, TRT-LLM, vLLM
  • Model optimization techniques: quantization, pruning, distillation
  • Solid understanding of modern model architectures (Transformers, CNNs, RNNs)

Other signals

  • Optimize Inference Performance
  • Model Compression
  • Hardware Acceleration
  • Infrastructure Design