Sr. Engineer, Kernel Development and Optimization

Tenstorrent · Semiconductors · Belgrade, Serbia · OPs

Sr. Engineer, Kernel Development and Optimization at Tenstorrent, focusing on designing, implementing, and optimizing performance-critical kernels for AI hardware, including matrix multiplication and attention primitives. The role involves host-side orchestration, parallelization, developing benchmarks and tests, and collaborating with compiler, runtime, ML, and hardware teams to integrate kernels into production systems. Experience with C++, low-level software, concurrency, and data-driven optimization is required.

What you'd actually do

Design, implement, and optimize GPU-style kernels such as matrix multiplication, attention primitives, and data-movement operations.
Clear ownership of performance, from identifying bottlenecks to delivering measurable throughput improvements.
Contribution to host-side orchestration code and parallelization strategies.
Development of micro-benchmarks, regression tests, and tooling to ensure correctness and sustained performance gains.
Close collaboration with compiler, runtime, ML, and hardware teams to integrate kernels into production systems.

Skills

Required

C++ systems engineering
performance-critical software development
low-level software development
concurrency
synchronization
latency hiding
compute vs memory trade-offs
profiling
benchmarking
debugging complex runtime or kernel-level issues
structured thinking
problem decomposition
designing GPU-style kernels
implementing GPU-style kernels
optimizing GPU-style kernels
matrix multiplication
attention primitives
data-movement operations
performance ownership
bottleneck identification
throughput improvement delivery
host-side orchestration code development
parallelization strategy development
micro-benchmark development
regression test development
tooling development for correctness and performance
collaboration with compiler teams
collaboration with runtime teams
collaboration with ML teams
collaboration with hardware teams
kernel integration into production systems

Nice to have

experience writing performance-critical or low-level software
data-driven approach
using profiling and benchmarking results to guide optimization decisions
effective at debugging complex runtime or kernel-level issues in large codebases
structured thinker who can break down ambiguous performance problems into measurable experiments
AI-assisted and agentic workflows for kernel generation, debugging, and optimization
writing and optimizing accelerator kernels outside traditional CUDA-first ecosystems
translating performance intuition into rigorous, reproducible engineering results
understanding how low-level kernels, compilers, runtime systems, and hardware co-evolve in modern AI platforms

What the JD emphasized

performance-critical kernels
ML workloads
GPU-style kernels
matrix multiplication
attention primitives
data-movement operations
throughput improvements
host-side orchestration
parallelization strategies
micro-benchmarks
regression tests
compiler integration
runtime integration
hardware integration
AI hardware
accelerator kernels
CUDA-first ecosystems
AI-assisted workflows
agentic workflows
kernel generation
kernel debugging
kernel optimization
low-level kernels
compilers
runtime systems
modern AI platforms

Other signals

performance-critical kernels
ML workloads
GPU-style kernels
matrix multiplication
attention primitives
data-movement operations
throughput improvements
host-side orchestration
parallelization strategies
micro-benchmarks
regression tests
compiler integration
runtime integration
hardware integration
AI hardware
accelerator kernels
CUDA-first ecosystems
AI-assisted workflows
agentic workflows
kernel generation
kernel debugging
kernel optimization
low-level kernels
compilers
runtime systems
modern AI platforms

Read full job description

Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities.

Tenstorrent is building next-generation AI compute. The Kernel Development and Optimization team develops the performance-critical kernels that unlock the full capability of our hardware across ML and HPC workloads.

This role is** **hybrid based out of Belgrade, Serbia.

We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting.

Who You Are

A strong C++ systems engineer with experience writing performance-critical or low-level software.
Comfortable reasoning about concurrency, synchronization, latency hiding, and compute versus memory trade-offs.
Data-driven in your approach, using profiling and benchmarking results to guide optimization decisions.
Effective at debugging complex runtime or kernel-level issues in large codebases.
Structured thinker who can break down ambiguous performance problems into measurable experiments.

What We Need

Engineers who can design, implement, and optimize GPU-style kernels such as matrix multiplication, attention primitives, and data-movement operations.
Clear ownership of performance, from identifying bottlenecks to delivering measurable throughput improvements.
Contribution to host-side orchestration code and parallelization strategies.
Development of micro-benchmarks, regression tests, and tooling to ensure correctness and sustained performance gains.
Close collaboration with compiler, runtime, ML, and hardware teams to integrate kernels into production systems.

What You Will Learn

The execution model, memory architecture, and performance characteristics of Tenstorrent AI hardware.
How to write and optimize accelerator kernels outside traditional CUDA-first ecosystems.
Practical AI-assisted and agentic workflows for kernel generation, debugging, and optimization.
How to translate performance intuition into rigorous, reproducible engineering results.
How low-level kernels, compilers, runtime systems, and hardware co-evolve in modern AI platforms.

Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer.

This offer of employment is contingent upon the applicant being eligible to access U.S. export-controlled technology. Due to U.S. export laws, including those codified in the U.S. Export Administration Regulations (EAR), the Company is required to ensure compliance with these laws when transferring technology to nationals of certain countries (such as EAR Country Groups D:1, E1, and E2). These requirements apply to persons located in the U.S. and all countries outside the U.S. As the position offered will have direct and/or indirect access to information, systems, or technologies subject to these laws, the offer may be contingent upon your citizenship/permanent residency status or ability to obtain prior license approval from the U.S. Commerce Department or applicable federal agency. If employment is not possible due to U.S. export laws, any offer of employment will be rescinded.

This role is** **hybrid based out of Belgrade, Serbia.

Who You Are

A strong C++ systems engineer with experience writing performance-critical or low-level software.
Comfortable reasoning about concurrency, synchronization, latency hiding, and compute versus memory trade-offs.
Data-driven in your approach, using profiling and benchmarking results to guide optimization decisions.
Effective at debugging complex runtime or kernel-level issues in large codebases.
Structured thinker who can break down ambiguous performance problems into measurable experiments.

What We Need

Engineers who can design, implement, and optimize GPU-style kernels such as matrix multiplication, attention primitives, and data-movement operations.
Clear ownership of performance, from identifying bottlenecks to delivering measurable throughput improvements.
Contribution to host-side orchestration code and parallelization strategies.
Development of micro-benchmarks, regression tests, and tooling to ensure correctness and sustained performance gains.
Close collaboration with compiler, runtime, ML, and hardware teams to integrate kernels into production systems.

What You Will Learn

The execution model, memory architecture, and performance characteristics of Tenstorrent AI hardware.
How to write and optimize accelerator kernels outside traditional CUDA-first ecosystems.
Practical AI-assisted and agentic workflows for kernel generation, debugging, and optimization.
How to translate performance intuition into rigorous, reproducible engineering results.
How low-level kernels, compilers, runtime systems, and hardware co-evolve in modern AI platforms.

Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer.