ML Research Engineer (inference)

Cerebras Cerebras · Semiconductors · India · Software

Research Engineer focused on optimizing advanced language and vision models for efficient inference on Cerebras' AI hardware. The role involves adapting models, optimizing for performance (latency, throughput), and pushing the frontier of techniques like speculative decoding, pruning, and sparse attention.

What you'd actually do

  1. Implement and adapt transformer-based models (NLP and/or vision) to run on Cerebras hardware
  2. Assist in optimizing models for inference performance (latency, throughput)
  3. Run experiments, analyze results, and support model improvements
  4. Help bring up and validate models on the Cerebras system
  5. Debug and troubleshoot model or system issues with guidance from senior team members

Skills

Required

  • Python
  • PyTorch, Transformers, vLLM or SGLang
  • deep learning concepts
  • Generative AI and Machine Learning systems
  • C++

Nice to have

  • speculative decoding
  • neural network pruning and compression
  • sparse attention
  • quantization
  • sparsity
  • post-training techniques
  • inference-focused evaluations
  • large language models
  • computer vision models
  • running experiments or tuning models
  • Hugging Face Transformers
  • performance concepts (latency, throughput)
  • Linux environments

What the JD emphasized

  • speculative decoding
  • large-model pruning and compression
  • sparse attention
  • sparsity-driven techniques
  • low-latency, high-throughput inference at scale
  • inference performance
  • post-training techniques
  • inference-focused evaluations

Other signals

  • Optimizing models for inference performance
  • Pushing the frontier of speculative decoding, large-model pruning and compression, sparse attention, and sparsity-driven techniques
  • Adapting advanced language and vision models to run efficiently on Cerebras hardware