LLM Inference Frameworks and Optimization Engineer

Together AI Together AI · Data AI · Remote · Research

Seeking an Inference Frameworks and Optimization Engineer to design, develop, and optimize distributed inference engines for multimodal and language models. Focus on low-latency, high-throughput inference, GPU/accelerator optimizations, and software-hardware co-design for efficient large-scale AI deployment.

What you'd actually do

  1. Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models.
  2. Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving.
  3. Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability.
  4. Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators.
  5. Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines.

Skills

Required

  • Deep learning inference frameworks
  • Distributed systems
  • High-performance computing
  • LLM inference frameworks (TensorRT-LLM, vLLM, SGLang, TGI)
  • GPU programming (CUDA, Triton, TensorRT)
  • Compiler optimization
  • Model quantization
  • GPU cluster scheduling
  • Python
  • C++
  • CUDA
  • Transformer architectures
  • LLM/VLM/Diffusion model optimization
  • Workload scheduling
  • CUDA graph
  • Speculative decoding

Nice to have

  • Software systems for large-scale data center networks with RDMA/RoCE
  • Distributed filesystem (3FS, HDFS, Ceph)
  • Kubernetes (K8S)
  • Open-source deep learning inference projects

What the JD emphasized

  • 3+ years of experience in deep learning inference frameworks, distributed systems, or high-performance computing
  • Familiar with at least one LLM inference frameworks (e.g., TensorRT-LLM, vLLM, SGLang, TGI(Text Generation Inference))
  • Background knowledge and experience in at least one of the following: GPU programming (CUDA/Triton/TensorRT), compiler, model quantization, and GPU cluster scheduling
  • Proficient in Python and C++/CUDA for high-performance deep learning inference
  • Deep understanding of Transformer architectures and LLM/VLM/Diffusion model optimization
  • Knowledge of inference optimization, such as workload scheduling, CUDA graph, compiled, efficient kernels

Other signals

  • optimize inference frameworks
  • scalable inference
  • low-latency, high-throughput inference
  • GPU/accelerator optimizations
  • software-hardware co-design
  • efficient large-scale deployment