Software Engineer - Model Performance

Baseten · Data AI · San Francisco, CA · EPD

Software Engineer focused on ML performance for LLM inference, optimizing techniques like quantization and speculative decoding, and debugging ML performance issues in libraries like TensorRT and PyTorch.

What you'd actually do

  1. Implement, refine, and productionize cutting-edge techniques (quantization, speculative decoding, kv cache reuse, chunked prefill and LoRA) for ML model inference and infrastructure.
  2. Deep dive into underlying codebases of TensorRT, PyTorch, TensorRT-LLM, vllm, sglang, CUDA, and other libraries to debug ML performance issues.
  3. Apply and scale optimization techniques across a wide range of ML models, particularly large language models.
  4. Collaborate with a diverse team to design and implement innovative solutions.
  5. Own projects from idea to production.

Skills

Required

  • Python
  • C++
  • LLM optimization techniques
  • quantization
  • speculative decoding
  • continuous batching
  • PyTorch
  • TensorRT
  • TensorRT-LLM
  • GPU architecture

Nice to have

  • enhancing the performance of software systems
  • large language models (LLMs)
  • CUDA
  • software engineering principles
  • developing and deploying AI/ML inference solutions
  • Docker
  • Kubernetes

What the JD emphasized

  • ML model inference
  • LLM Inference
  • LLM optimization techniques
  • LLM's
  • AI/ML inference solutions

Other signals

  • LLM Inference
  • ML performance
  • optimization techniques
  • GPU architecture