LLM Inference Performance & Evals Engineer

Cerebras Cerebras · Semiconductors · Toronto, ON · Software

Cerebras is seeking an LLM Inference Performance & Evals Engineer to optimize and validate state-of-the-art models on their wafer-scale AI hardware. The role involves prototyping architectural tweaks, building performance-evaluation pipelines, and collaborating with hardware and software teams to accelerate new model ideas and improve inference speeds.

What you'd actually do

  1. Prototype and benchmark cutting-edge ideas: new attentions, MoE, speculative decoding, and many more innovations as they emerge.
  2. Develop agent-driven automation that designs experiments, schedules runs, triages regressions, and drafts pull-requests.
  3. Work closely with compiler, runtime, and silicon teams: unique opportunity to experience the full stack of software/hardware innovation.
  4. Keep pace with the latest open- and closed-source models; run them first on wafer scale to expose new optimization opportunities.

Skills

Required

  • 3 + years building high-performance ML or systems software.
  • Solid grounding in Transformer math—attention scaling, KV-cache, quantisation—or clear evidence you learn this material rapidly.
  • Comfort navigating the full AI toolchain: Python modeling code, compiler IRs, performance profiling, etc.
  • Strong debugging skills across performance, numerical accuracy, and runtime integration.
  • Prior experience in modeling, compilers or crafting benchmarks or performance studies; not just black-box QA tests.
  • Strong passion to leverage AI agents or workflow orchestration tools to boost personal productivity.

Nice to have

  • Hands-on with flash-attention, Triton kernels, linear-attention, or sparsity research.
  • Performance-tuning experience on custom silicon, GPUs, or FPGAs.
  • Proficiency in C/C++ programming and experience with low-level optimization.
  • Proven experience in compiler development, particularly with LLVM and/or MLIR.
  • Publications, repos, or blog posts dissecting model speed-ups.
  • Contributions to open-source agent frameworks.

What the JD emphasized

  • high-performance ML or systems software
  • Transformer math
  • performance profiling
  • crafting benchmarks or performance studies

Other signals

  • LLM Inference Performance
  • Wafer-Scale Hardware
  • Benchmarking
  • Evals