Research Engineer - Llm/vlm Inference Optimization (seed Infra)

ByteDance · Big Tech · San Jose, CA · R&D

Research Engineer focused on optimizing LLM/VLM inference systems, including engines, serving frameworks, and deployment pipelines, using advanced performance techniques and collaborating with research teams.

What you'd actually do

Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model toolchains and the broader technical ecosystem.

Skills

Required

C++
Python
algorithms
data structures
systems programming
containerization
server-side debugging
PyTorch
TensorFlow
LLM/VLM inference deployment
latency optimization
throughput optimization
serving cost optimization
GPU architecture
compute-intensive operators

Nice to have

large-scale LLM serving infrastructure
production LLM deployment
CUDA
OpenCL
TensorRT
Triton
CUTLASS
performance modeling
profiling
CPU/GPU architectures
model/data parallelism frameworks
distributed inference

What the JD emphasized

optimizing LLM/VLM inference at production scale
demonstrated impact on latency, throughput, or serving cost
large-scale LLM serving infrastructure or equivalent production LLM deployment experience

Other signals

optimizing inference systems
large-scale LLMs and VLMs
production scale deployment
latency, throughput, or serving cost

Read full job description

About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model toolchains and the broader technical ecosystem.

Requirements

Minimum Qualifications:

Bachelor's degree or above in Computer Science, Electrical Engineering, Software Engineering, or a related field.
Strong proficiency in C/C++ and Python; solid foundations in algorithms, data structures, and systems programming; familiarity with containerization and server-side debugging.
Hands-on experience with at least one mainstream machine learning framework (e.g., PyTorch, TensorFlow).
Experience deploying or optimizing LLM/VLM inference at production scale, with demonstrated impact on latency, throughput, or serving cost.
Familiarity with GPU architecture and experience optimizing compute-intensive operators (e.g., FlashAttention, GEMM, GEMV, Conv2D).

Preferred Qualifications:

Experience with large-scale LLM serving infrastructure or equivalent production LLM deployment experience.
Experience in GPU programming (CUDA/OpenCL) and familiarity with frameworks such as TensorRT, Triton, or CUTLASS.
Experience in performance modeling, profiling, and optimization, or strong knowledge of CPU/GPU architectures.
Familiarity with model/data parallelism frameworks for distributed inference.

Responsibilities

Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model toolchains and the broader technical ecosystem.

Requirements

Minimum Qualifications:

Bachelor's degree or above in Computer Science, Electrical Engineering, Software Engineering, or a related field.
Strong proficiency in C/C++ and Python; solid foundations in algorithms, data structures, and systems programming; familiarity with containerization and server-side debugging.
Hands-on experience with at least one mainstream machine learning framework (e.g., PyTorch, TensorFlow).
Experience deploying or optimizing LLM/VLM inference at production scale, with demonstrated impact on latency, throughput, or serving cost.
Familiarity with GPU architecture and experience optimizing compute-intensive operators (e.g., FlashAttention, GEMM, GEMV, Conv2D).

Preferred Qualifications:

Experience with large-scale LLM serving infrastructure or equivalent production LLM deployment experience.
Experience in GPU programming (CUDA/OpenCL) and familiarity with frameworks such as TensorRT, Triton, or CUTLASS.
Experience in performance modeling, profiling, and optimization, or strong knowledge of CPU/GPU architectures.
Familiarity with model/data parallelism frameworks for distributed inference.