What you'd actually do

Recruit, hire, and develop a high-performing team of systems engineers with deep container and Linux expertise.

Foster a culture of technical rigor, open-source contribution, and continuous improvement.

Provide regular coaching, feedback, and career development support to your direct reports.

Partner with engineering leadership to define the long-term vision and roadmap for container runtime and storage infrastructure.

Guide the team in extending and hardening containerd, runc, and related OCI ecosystem projects to meet the GPU-specific requirements of production AI inference, including startup performance, GPU device access, and multi-tenant isolation.

Skills

Required

Proven experience managing and growing engineering teams in a systems, infrastructure, or low-level runtime context.
Deep familiarity with the Linux container ecosystem: containerd, runc, OCI Runtime Spec, Linux namespaces, and cgroups, with the ability to engage credibly in code reviews and architectural discussions.
Contributions to containerd/containerd, opencontainers/runc, google/gvisor, kata-containers/kata-containers, or closely related open-source projects.
Strong systems programming background in Go and/or C/C++.
Experience with distributed storage systems, content-addressable storage, or large-scale caching infrastructure.
Understanding of how container images are structured, stored, and delivered at scale.
Strong written and verbal communication skills, with the ability to influence without authority across teams.

Nice to have

Experience with GPU device access in containers: NVIDIA Container Toolkit, CDI (Container Device Interface), or GPU-aware scheduling.
Familiarity with lazy-loading snapshotters (stargz, soci, EROFS/Nydus) or peer-to-peer image distribution.
Experience with secure container runtimes (gVisor, Sysbox) or micro-VM technologies (Firecracker, Cloud Hypervisor).
Understanding of containerd's shim API (v2) and experience building custom shim implementations.
Background in multi-tenant infrastructure or security-sensitive serving environments.

What the JD emphasized

deep container and Linux expertise

containerd, runc, and related OCI ecosystem projects

GPU-specific requirements of production AI inference

startup performance

GPU device access

multi-tenant isolation

Baseten Delivery Network

cold starts

burst scaling events

model weights

container images

training checkpoints

deployment artifacts

GPU-aware isolation mechanisms

secure container runtimes

Linux namespace hardening

micro-VM integration

end-to-end ownership

container startup performance path

snapshotter initialization

weight delivery

first inference request

open-source containerd ecosystem

core maintainers

ABOUT BASETEN

Baseten powers mission-critical inference for the world's most dynamic AI companies, like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma and Writer. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting-edge models into production. We're growing quickly and recently raised our $300M Series E, backed by investors including BOND, IVP, Spark Capital, Greylock, and Conviction. Join us and help build the platform engineers turn to to ship AI products.

THE ROLE

Container runtimes were designed for general-purpose software workloads. AI inference is not a general-purpose workload.

Running large models at production scale exposes cracks in every layer of the container stack: runtimes unaware of GPU memory constraints, images that take minutes to pull when a model needs to scale to thousands of replicas, and isolation mechanisms that weren't designed for the multi-tenant serving environments that production AI requires. The tools the industry has relied on for a decade weren't built for this, and patching around those limitations at higher layers only goes so far.

Baseten owns the entire pipeline, from the moment a developer pushes a model to the moment a request gets a response. That vertical ownership means we can fix these problems at the root. The Runtime Fabrics team is doing exactly that: purpose-building the container runtime and storage layers for AI inference workloads, led by some of the world's top containerd maintainers.

As Engineering Manager of the Runtime Fabrics team, you will lead this work, setting technical direction, growing a world-class team of systems engineers, and ensuring the team's output shapes not just Baseten's infrastructure but the open-source container ecosystem at large. If you've contributed to containerd, runc, or related OCI projects and are ready to lead a team solving some of the hardest problems in infrastructure today, we'd love to talk.

RESPONSIBILITIES

Team Leadership & Culture

Recruit, hire, and develop a high-performing team of systems engineers with deep container and Linux expertise.
Foster a culture of technical rigor, open-source contribution, and continuous improvement.
Provide regular coaching, feedback, and career development support to your direct reports.
Partner with engineering leadership to define the long-term vision and roadmap for container runtime and storage infrastructure.

Technical Direction

Guide the team in extending and hardening containerd, runc, and related OCI ecosystem projects to meet the GPU-specific requirements of production AI inference, including startup performance, GPU device access, and multi-tenant isolation.
Oversee the architecture and evolution of the Baseten Delivery Network: the tiered caching and weight delivery system that makes cold starts 2–3x faster and eliminates thundering herd failures during burst scaling events.
Drive the expansion of BDN's architecture, currently focused on model weights, to container images, training checkpoints, and deployment artifacts.
Provide technical oversight on GPU-aware isolation mechanisms for multi-tenant inference, including secure container runtimes, Linux namespace hardening, and longer-term micro-VM integration.
Ensure the team maintains end-to-end ownership of the container startup performance path, from snapshotter initialization through weight delivery to first inference request.
Champion the team's contributions back to the open-source containerd ecosystem alongside a team of core maintainers.

Cross-Functional Partnership

Act as the primary advocate for Runtime Fabrics across the organization, ensuring upstream and downstream teams have the integration support they need.
Collaborate with product and engineering stakeholders to prioritize investments based on business impact and infrastructure reliability.
Communicate team progress, technical trade-offs, and architectural decisions clearly to leadership.

REQUIREMENTS

Proven experience managing and growing engineering teams in a systems, infrastructure, or low-level runtime context.
Deep familiarity with the Linux container ecosystem: containerd, runc, OCI Runtime Spec, Linux namespaces, and cgroups, with the ability to engage credibly in code reviews and architectural discussions.
Contributions to containerd/containerd, opencontainers/runc, google/gvisor, kata-containers/kata-containers, or closely related open-source projects.
Strong systems programming background in Go and/or C/C++.
Experience with distributed storage systems, content-addressable storage, or large-scale caching infrastructure.
Understanding of how container images are structured, stored, and delivered at scale.
Strong written and verbal communication skills, with the ability to influence without authority across teams.

NICE TO HAVE

Experience with GPU device access in containers: NVIDIA Container Toolkit, CDI (Container Device Interface), or GPU-aware scheduling.
Familiarity with lazy-loading snapshotters (stargz, soci, EROFS/Nydus) or peer-to-peer image distribution.
Experience with secure container runtimes (gVisor, Sysbox) or micro-VM technologies (Firecracker, Cloud Hypervisor).
Understanding of containerd's shim API (v2) and experience building custom shim implementations.
Background in multi-tenant infrastructure or security-sensitive serving environments.

BENEFITS

Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Fertility and family-building stipend through Carrot
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Apply now to embark on a rewarding journey in shaping the future of AI! If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward-thinking team, we would love to hear from you.

At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.

We are an Equal Opportunity Employer and will consider qualified applicants with criminal histories in a manner consistent with applicable law (by example, the requirements of the San Francisco Fair Chance Ordinance, where applicable).