Research Engineer - Llm/vlm Inference Optimization (seed Infra)

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Research Engineer focused on optimizing LLM/VLM inference systems, including engines, serving frameworks, and deployment pipelines, using advanced performance techniques and collaborating with research teams.

What you'd actually do

  1. Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
  2. Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
  3. Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model toolchains and the broader technical ecosystem.

Skills

Required

  • C++
  • Python
  • algorithms
  • data structures
  • systems programming
  • containerization
  • server-side debugging
  • PyTorch
  • TensorFlow
  • LLM/VLM inference deployment
  • latency optimization
  • throughput optimization
  • serving cost optimization
  • GPU architecture
  • compute-intensive operators

Nice to have

  • large-scale LLM serving infrastructure
  • production LLM deployment
  • CUDA
  • OpenCL
  • TensorRT
  • Triton
  • CUTLASS
  • performance modeling
  • profiling
  • CPU/GPU architectures
  • model/data parallelism frameworks
  • distributed inference

What the JD emphasized

  • optimizing LLM/VLM inference at production scale
  • demonstrated impact on latency, throughput, or serving cost
  • large-scale LLM serving infrastructure or equivalent production LLM deployment experience

Other signals

  • optimizing inference systems
  • large-scale LLMs and VLMs
  • production scale deployment
  • latency, throughput, or serving cost