ML Research Engineer (inference)

Cerebras · Semiconductors · India · Software

Research Engineer focused on adapting and optimizing advanced language and vision models for efficient inference on Cerebras' wafer-scale AI architecture. The role involves implementing, validating, and optimizing models for low-latency, high-throughput inference, with a focus on techniques like speculative decoding, pruning, compression, and sparsity.

What you'd actually do

  1. Implement and adapt transformer-based models (NLP and/or vision) to run on Cerebras hardware
  2. Assist in optimizing models for inference performance (latency, throughput)
  3. Run experiments, analyze results, and support model improvements
  4. Help bring up and validate models on the Cerebras system
  5. Debug and troubleshoot model or system issues with guidance from senior team members

Skills

Required

  • Python
  • PyTorch
  • Transformers
  • Generative AI
  • Machine Learning systems
  • deep learning concepts
  • neural networks
  • transformers
  • Python
  • C++

Nice to have

  • speculative decoding
  • neural network pruning and compression
  • sparse attention
  • quantization
  • sparsity
  • post-training techniques
  • inference-focused evaluations
  • large language models
  • computer vision models
  • PyTorch
  • Hugging Face Transformers
  • performance concepts
  • latency
  • throughput
  • Linux environments

What the JD emphasized

  • fastest Generative AI inference solution
  • speculative decoding
  • large-model pruning and compression
  • sparse attention
  • sparsity-driven techniques
  • low-latency, high-throughput inference at scale

Other signals

  • fastest Generative AI inference solution
  • speculative decoding
  • large-model pruning and compression
  • sparse attention
  • sparsity-driven techniques
  • low-latency, high-throughput inference at scale