Research Engineer - Llm/vlm Inference Optimization (seed Infra)

ByteDance · Big Tech · Seattle, WA · R&D

Research Engineer focused on optimizing LLM/VLM inference systems, including inference engines, serving frameworks, and deployment pipelines. Requires expertise in performance optimization techniques, C/C++, Python, ML frameworks, and production-scale LLM inference deployment.

What you'd actually do

Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model toolchains and the broader technical ecosystem.

Skills

Required

C/C++
Python
algorithms
data structures
systems programming
containerization
server-side debugging
PyTorch
TensorFlow
LLM/VLM inference deployment
GPU architecture
compute-intensive operator optimization

Nice to have

large-scale LLM serving infrastructure
production LLM deployment
GPU programming (CUDA/OpenCL)
TensorRT
Triton
CUTLASS
performance modeling
profiling
CPU/GPU architectures
model/data parallelism frameworks for distributed inference

What the JD emphasized

Experience deploying or optimizing LLM/VLM inference at production scale, with demonstrated impact on latency, throughput, or serving cost.

Other signals

LLM/VLM inference optimization
production scale deployment
performance optimization techniques

Read full job description

About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities

Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model toolchains and the broader technical ecosystem.

Requirements

Minimum Qualifications:

Bachelor's degree or above in Computer Science, Electrical Engineering, Software Engineering, or a related field.
Strong proficiency in C/C++ and Python; solid foundations in algorithms, data structures, and systems programming; familiarity with containerization and server-side debugging.
Hands-on experience with at least one mainstream machine learning framework (e.g., PyTorch, TensorFlow).
Experience deploying or optimizing LLM/VLM inference at production scale, with demonstrated impact on latency, throughput, or serving cost.
Familiarity with GPU architecture and experience optimizing compute-intensive operators (e.g., FlashAttention, GEMM, GEMV, Conv2D).

Preferred Qualifications:

Experience with large-scale LLM serving infrastructure or equivalent production LLM deployment experience.
Experience in GPU programming (CUDA/OpenCL) and familiarity with frameworks such as TensorRT, Triton, or CUTLASS.
Experience in performance modeling, profiling, and optimization, or strong knowledge of CPU/GPU architectures.
Familiarity with model/data parallelism frameworks for distributed inference.

Responsibilities

Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model toolchains and the broader technical ecosystem.

Requirements

Minimum Qualifications:

Bachelor's degree or above in Computer Science, Electrical Engineering, Software Engineering, or a related field.
Strong proficiency in C/C++ and Python; solid foundations in algorithms, data structures, and systems programming; familiarity with containerization and server-side debugging.
Hands-on experience with at least one mainstream machine learning framework (e.g., PyTorch, TensorFlow).
Experience deploying or optimizing LLM/VLM inference at production scale, with demonstrated impact on latency, throughput, or serving cost.
Familiarity with GPU architecture and experience optimizing compute-intensive operators (e.g., FlashAttention, GEMM, GEMV, Conv2D).

Preferred Qualifications:

Experience with large-scale LLM serving infrastructure or equivalent production LLM deployment experience.
Experience in GPU programming (CUDA/OpenCL) and familiarity with frameworks such as TensorRT, Triton, or CUTLASS.
Experience in performance modeling, profiling, and optimization, or strong knowledge of CPU/GPU architectures.
Familiarity with model/data parallelism frameworks for distributed inference.