Sr. Engineer, Kernel Development and Optimization

Tenstorrent · Semiconductors · Belgrade, Serbia · OPs

Sr. Engineer, Kernel Development and Optimization at Tenstorrent, focusing on designing, implementing, and optimizing performance-critical kernels for AI hardware, including matrix multiplication and attention primitives. The role involves host-side orchestration, parallelization, developing benchmarks and tests, and collaborating with compiler, runtime, ML, and hardware teams to integrate kernels into production systems. Experience with C++, low-level software, concurrency, and data-driven optimization is required.

What you'd actually do

  1. Design, implement, and optimize GPU-style kernels such as matrix multiplication, attention primitives, and data-movement operations.
  2. Clear ownership of performance, from identifying bottlenecks to delivering measurable throughput improvements.
  3. Contribution to host-side orchestration code and parallelization strategies.
  4. Development of micro-benchmarks, regression tests, and tooling to ensure correctness and sustained performance gains.
  5. Close collaboration with compiler, runtime, ML, and hardware teams to integrate kernels into production systems.

Skills

Required

  • C++ systems engineering
  • performance-critical software development
  • low-level software development
  • concurrency
  • synchronization
  • latency hiding
  • compute vs memory trade-offs
  • profiling
  • benchmarking
  • debugging complex runtime or kernel-level issues
  • structured thinking
  • problem decomposition
  • designing GPU-style kernels
  • implementing GPU-style kernels
  • optimizing GPU-style kernels
  • matrix multiplication
  • attention primitives
  • data-movement operations
  • performance ownership
  • bottleneck identification
  • throughput improvement delivery
  • host-side orchestration code development
  • parallelization strategy development
  • micro-benchmark development
  • regression test development
  • tooling development for correctness and performance
  • collaboration with compiler teams
  • collaboration with runtime teams
  • collaboration with ML teams
  • collaboration with hardware teams
  • kernel integration into production systems

Nice to have

  • experience writing performance-critical or low-level software
  • data-driven approach
  • using profiling and benchmarking results to guide optimization decisions
  • effective at debugging complex runtime or kernel-level issues in large codebases
  • structured thinker who can break down ambiguous performance problems into measurable experiments
  • AI-assisted and agentic workflows for kernel generation, debugging, and optimization
  • writing and optimizing accelerator kernels outside traditional CUDA-first ecosystems
  • translating performance intuition into rigorous, reproducible engineering results
  • understanding how low-level kernels, compilers, runtime systems, and hardware co-evolve in modern AI platforms

What the JD emphasized

  • performance-critical kernels
  • ML workloads
  • GPU-style kernels
  • matrix multiplication
  • attention primitives
  • data-movement operations
  • throughput improvements
  • host-side orchestration
  • parallelization strategies
  • micro-benchmarks
  • regression tests
  • compiler integration
  • runtime integration
  • hardware integration
  • AI hardware
  • accelerator kernels
  • CUDA-first ecosystems
  • AI-assisted workflows
  • agentic workflows
  • kernel generation
  • kernel debugging
  • kernel optimization
  • low-level kernels
  • compilers
  • runtime systems
  • modern AI platforms

Other signals

  • performance-critical kernels
  • ML workloads
  • GPU-style kernels
  • matrix multiplication
  • attention primitives
  • data-movement operations
  • throughput improvements
  • host-side orchestration
  • parallelization strategies
  • micro-benchmarks
  • regression tests
  • compiler integration
  • runtime integration
  • hardware integration
  • AI hardware
  • accelerator kernels
  • CUDA-first ecosystems
  • AI-assisted workflows
  • agentic workflows
  • kernel generation
  • kernel debugging
  • kernel optimization
  • low-level kernels
  • compilers
  • runtime systems
  • modern AI platforms