Staff Inference ML Runtime Engineer

Cerebras Cerebras · Semiconductors · Headquarters +2 · AI Cloud

Staff Inference ML Runtime Engineer at Cerebras Systems, focusing on optimizing and scaling their wafer-scale AI chip for high-throughput, low-latency generative AI inference. The role involves designing and implementing ML features, APIs, and distributed runtime solutions, working with state-of-the-art generative AI models and multimodal data.

What you'd actually do

  1. Drive and provide technical guidance to a team of software engineers working on complex machine learning integration projects.
  2. Design and implement ML features (e.g., structured outputs, biased sampling, predicted outputs) that improve performance of generative AI models at inference time.
  3. Design and implement high-throughput, low-latency multimodal inference models that support delivery of image, audio, and video inputs and outputs.
  4. Maintain our scalable serving backend for handling many concurrent requests per minute.
  5. Scale our inference service by implementing detailed observability throughout the entire stack.

Skills

Required

  • Python
  • C++
  • multi-threaded programming
  • performance optimization
  • system-level development
  • large-scale inference systems for LLMs or multimodal models
  • LLM serving frameworks (vLLM, SGLang, TensorRT-LLM)
  • software architectural patterns for large-scale, high-performance applications
  • ML frameworks (PyTorch)
  • problem-solving skills
  • communication and presentation skills

Nice to have

  • distributed runtime
  • compiler developers
  • cluster orchestrators
  • ML scientists
  • cloud architects
  • product teams
  • structured outputs
  • biased sampling
  • predicted outputs
  • image, audio, and video inputs and outputs
  • observability
  • latency
  • throughput
  • memory usage
  • compute efficiency
  • generative LLM inference
  • deep learning
  • state-of-the-art techniques
  • technical debt
  • automated test suites
  • agile development practices

What the JD emphasized

  • high-throughput
  • low-latency
  • generative AI models
  • inference
  • multimodal inference models
  • scalable serving backend
  • large-scale software engineering
  • deep learning
  • LLMs
  • multimodal models
  • LLM serving frameworks
  • large-scale, high-performance applications
  • ML frameworks
  • PyTorch

Other signals

  • inference
  • serving
  • performance optimization
  • distributed systems
  • large-scale ML