Research Engineer - Llm/vlm Inference Optimization (seed Infra)

ByteDance ByteDance · Big Tech · Seattle, WA · R&D

Research Engineer focused on optimizing LLM/VLM inference systems, including inference engines, serving frameworks, and deployment pipelines. Requires expertise in performance optimization techniques, C/C++, Python, ML frameworks, and production-scale LLM inference deployment.

What you'd actually do

  1. Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
  2. Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
  3. Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model toolchains and the broader technical ecosystem.

Skills

Required

  • C/C++
  • Python
  • algorithms
  • data structures
  • systems programming
  • containerization
  • server-side debugging
  • PyTorch
  • TensorFlow
  • LLM/VLM inference deployment
  • GPU architecture
  • compute-intensive operator optimization

Nice to have

  • large-scale LLM serving infrastructure
  • production LLM deployment
  • GPU programming (CUDA/OpenCL)
  • TensorRT
  • Triton
  • CUTLASS
  • performance modeling
  • profiling
  • CPU/GPU architectures
  • model/data parallelism frameworks for distributed inference

What the JD emphasized

  • Experience deploying or optimizing LLM/VLM inference at production scale, with demonstrated impact on latency, throughput, or serving cost.

Other signals

  • LLM/VLM inference optimization
  • production scale deployment
  • performance optimization techniques