Skills

Required

Kubernetes
GPU infrastructure
SRE
platform engineering
infrastructure roles
Kubernetes internals
scheduler
kubelet
CRDs
operators
admission controllers
GPU/accelerator training workloads
Multi-cluster management
federation
workload placement strategies
Helm
Kustomize
GitOps (Flux/ArgoCD)
AWS
VPC
EKS
EC2
S3
IAM
TGW
Terraform
Pulumi
CI/CD for infrastructure
drift detection
plan gating
rollback strategies
Cost optimization
reserved capacity planning
spot instance management
Prometheus
Grafana
AlertManager
Distributed tracing
OpenTelemetry
Jaeger
Tempo
Log aggregation
Loki
Elasticsearch/OpenSearch
SLO/SLI design
error budget policy
multi-tier alerting
TCP/IP
DNS
TLS
HTTP/2
gRPC
CNI plugins
Cilium
Calico
Flannel
Service mesh (Istio/Linkerd)
ingress controllers
API gateways
Network debugging
packet captures
eBPF traces
kernel counters
Go
Python
Rust
Distributed systems design
consistency
availability
failure modes
Kubernetes operator authoring
controller-runtime patterns
Technical writing
design docs
ADRs
runbooks
Leadership
Cross-Geo Collaboration
async-first collaboration
distributed, cross-timezone teams

Nice to have

Azure
GCP
ML training pipeline internals
eBPF-based observability
eBPF-based networking
Chaos engineering
game days
Open-source infrastructure contributions
Security
compliance
audit experience

What the JD emphasized

10+ years of experience

deep Kubernetes expertise

strong networking fundamentals

operated systems at massive scale

thousands of GPU hours depend on every day

massive, distributed training jobs running on GPU clusters spanning thousands of accelerators

Kubernetes & GPU Infrastructure

Expert-level Kubernetes internals

Proven experience running GPU/accelerator training workloads at scale

Deep AWS hands-on experience required

Prometheus, Grafana, AlertManager — at scale, not just lab setups

Deep TCP/IP, DNS, TLS, HTTP/2, gRPC — not just surface familiarity

Production-quality code in Go, Python, or Rust — you ship, not just script

Led multi-quarter, cross-functional projects from whiteboard to production

Other signals

powers our most demanding ML training workloads

architecting systems

leading multi-quarter projects

reliability bar for an infrastructure that thousands of GPU hours depend on

cutting-edge platform designed to train and serve large-scale machine learning models

supports everything from small-scale experimentation to massive, distributed training jobs running on GPU clusters spanning thousands of accelerators

provides ML engineers and researchers with the tools to onboard, monitor, and scale their workloads

Dynamic GPU orchestration

Training & inference workflows

Observability & cost tracking

Self-service developer tooling

Multi-cloud infrastructure

reliability, scalability, and efficiency of this platform

speed at which AI teams can innovate

Role Overview We are looking for a Senior Infrastructure Developer with 10+ years of experience to own, evolve, and scale the platform that powers our most demanding ML training workloads. This is not a "keep the lights on" role — you will be architecting systems, writing production-grade code, leading multi-quarter projects across geo-distributed teams, and setting the reliability bar for an infrastructure that thousands of GPU hours depend on every day. You bring deep Kubernetes expertise, strong networking fundamentals, a developer's mindset, and the leadership instincts to navigate ambiguity and drive alignment across cross-functional stakeholders. You have operated systems at massive scale and felt the weight of that responsibility. About the Platform You will be working on a cutting-edge platform designed to train and serve large-scale machine learning models. The platform supports everything from small-scale experimentation to massive, distributed training jobs running on GPU clusters spanning thousands of accelerators. It provides ML engineers and researchers with the tools to onboard, monitor, and scale their workloads — whether a lightweight prototype or a production-grade deep learning model powering real-world applications.

Key platform capabilities:

**Dynamic GPU orchestration **using Kubernetes with custom schedulers and resource topology awareness. **Training & inference workflows **end-to-end pipeline support from data ingestion through model serving. **Observability & cost tracking **full-stack visibility across compute, network, and storage layers. **Self-service developer tooling **enabling high-velocity experimentation without platform bottlenecks. **Multi-cloud infrastructure **primarily AWS with Azure/GCP expansion underway.

Your contributions will directly determine the reliability, scalability, and efficiency of this platform — and the speed at which AI teams can innovate. What You’ll Do

**Architect for scale **Design and evolve Kubernetes-native infrastructure capable of running distributed GPU training jobs at massive scale, with an obsession for reliability and efficiency. **Lead cross-geo initiatives **Own complex, multi-team projects end-to-end — write design docs, align stakeholders across time zones, and drive delivery in ambiguous, fast-moving environments. **Codify infrastructure **Define and ship cloud infrastructure through IaC (Terraform/Pulumi). Treat infra changes with the same rigor, testing, and review as application code. **Build observability **Design and maintain deep observability stacks — metrics, distributed tracing, log aggregation, SLO/SLI frameworks — that surface problems before they become incidents. **Write production code **Build automation, internal tooling, operators, and platform services in Go, Python, or Rust. This is not a YAML-only role. **Own reliability **Lead incident response, post-mortems, and reliability reviews. Drive systemic fixes, not just workarounds. Set the on-call culture. **Solve hard networking problems **Debug and resolve complex cluster networking issues — CNI, BGP, service mesh, DNS at scale, east-west traffic, high-throughput tuning. **Mentor and grow the team **Raise the technical bar through code reviews, architectural guidance, and knowledge sharing with engineers across experience levels.

What You Bring Core Requirements: Kubernetes & GPU Infrastructure

10+ years in SRE, platform engineering, or infrastructure roles
Expert-level Kubernetes internals: scheduler, kubelet, CRDs, operators, admission controllers
Proven experience running GPU/accelerator training workloads at scale
Multi-cluster management, federation, and workload placement strategies
Helm, Kustomize, GitOps (Flux/ArgoCD) — and knowing when not to use them.

Cloud & Infrastructure as Code

Deep AWS hands-on experience required (VPC, EKS, EC2, S3, IAM, TGW)
Terraform or Pulumi — production-grade, modular, tested
CI/CD for infrastructure: drift detection, plan gating, rollback strategies
Cost optimization, reserved capacity planning, and spot instance management at scale

Observability

Prometheus, Grafana, AlertManager — at scale, not just lab setups
Distributed tracing: OpenTelemetry, Jaeger, Tempo
Log aggregation: Loki, Elasticsearch/OpenSearch
SLO/SLI design, error budget policy, and multi-tier alerting

Networking Fundamentals

Deep TCP/IP, DNS, TLS, HTTP/2, gRPC — not just surface familiarity
CNI plugins: Cilium, Calico, Flannel — trade-offs and production behavior
Service mesh (Istio/Linkerd), ingress controllers, and API gateways
Network debugging under load: packet captures, eBPF traces, kernel counters

Coding & System Design

Production-quality code in Go, Python, or Rust — you ship, not just script
Distributed systems design consistency, availability, failure modes
Kubernetes operator authoring and controller-runtime patterns
Strong code review culture — you raise the bar, not just the PR count
Technical writing: design docs, ADRs, runbooks that others actually read

Leadership & Cross-Geo Collaboration

Led multi-quarter, cross-functional projects from whiteboard to production
Thrives in ambiguity — creates structure and momentum without a perfect spec
Experienced in async-first collaboration across distributed, cross-timezone teams
Strong communicator: can translate infra complexity to product and leadership audiences
Self-driven — you identify the problem, propose the solution, and own the outcome

Bonus Points:

Azure / GCP hands-on depth
ML training pipeline internals
eBPF-based observability / networking
Chaos engineering & game days
Open-source infrastructure contributions
Security, compliance & audit experience

Why This Role

You will write software, not just YAML. This is a coding role as much as it is an operations role.
You will work on real AI infrastructure challenges — the kind that research papers get written about, not buzzword slide decks.
You will have impact across developer productivity, platform scalability, and service reliability simultaneously.
You will lead. This is not an IC-only position — you will shape the technical direction of the team and the platform.
You will join a team that values code quality, systems thinking, blameless culture, and genuine ownership.
You will architect systems at a scale most engineers never get to touch — thousands of GPUs, petabytes of data movement, milliseconds of scheduling latency that matter

If you have seen what breaks at scale and have built the systems, the culture, and the habits to make sure it never breaks again — we want to talk.

About Adobe

Adobe empowers everyone to create through innovative platforms and tools that unleash creativity, productivity and personalized customer experiences. Adobe’s industry-leading offerings including Adobe Acrobat Studio, Adobe Express, Adobe Firefly, Creative Cloud, Adobe Experience Platform, Adobe Experience Manager, and GenStudio enable people and businesses to turn ideas into impact, powered by AI and driven by human ingenuity.

Our 30,000+ employees worldwide are creating the future and raising the bar as we drive the next decade of growth. We’re on a mission to hire the very best and believe in creating a company culture where all employees are empowered to make an impact. At Adobe, we believe that great ideas can come from anywhere in the organization. The next big idea could be yours.

** Let’s Adobe together**

At Adobe, we believe in creating a company culture where all employees are empowered to make an impact. Learn more about Adobe life, including our values and culture, focus on people, purpose and community, Adobe for All, comprehensive benefits programs, the stories we tell, the customers we serve, and how you can help us advance our mission of empowering everyone to create.

Adobe is proud to be an Equal Employment Opportunity employer. We do not discriminate based on gender, race or color, ethnicity or national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, or any other protected characteristic. Learn more.

Adobe aims to make our Careers website and recruiting process accessible to any and all users. If you have a disability or special need that requires accommodation to navigate our website or complete the application process, email accommodations@adobe.com or call +1 408-536-3015.

AI Use Guidelines for Interviews: Our interviews are designed to reflect your own skills and thinking. The use of AI or recording tools during live interviews is not permitted unless explicitly invited by the interviewer or approved in advance as part of a reasonable accommodation. If these tools are used inappropriately or in a way that misrepresents your work, your application may not move forward in the process.

At Adobe, we empower employees to innovate with AI — and we look for candidates eager to do the same. As part of the hiring experience, we provide clear guidance on where AI is encouraged during the process and where it’s restricted during live interviews. See how we think about AI in the hiring experience.