What you'd actually do

Define the multi-year roadmap for Stripe’s Batch Compute Infrastructure, leading complex architectural shifts and modernization.

Build, mentor, and aggressively scale a high-performing team of engineers, proactively investing in their career development and fostering a culture of operational excellence.

Maintain unwavering reliability for a Tier-0 infrastructure processing tens of thousands of daily workloads, proactively mitigating risks and managing complex on-call telemetry.

Collaborate deeply with data platform teams, finance, and user groups to define compute efficiency metrics, execute massive-scale cost optimization strategies, and guarantee compliance with global financial regulations.

Provide technical guidance in architecture reviews, evaluating critical cost, performance, and reliability trade-offs in distributed systems design involving Hadoop, Spark, AWS cloud primitives, and modern metastores.

Skills

Required

10+ years of professional software development and engineering experience
3+ years of direct engineering management experience
building, scaling, and maintaining large-scale distributed data systems or Tier-0 infrastructure using open-source tools (e.g., Hadoop, Spark, Celeborn, Airflow, Kafka)
driving significant infrastructure efficiency
managing capacity planning
making data-driven cost-performance trade-offs
working effectively in highly cross-functional, global organizations

Nice to have

managing remote or geographically distributed engineering teams
managing a massive fleet of Linux servers, on-premise Hadoop clusters, and modern cloud data architectures (e.g., AWS S3, Graviton)
navigating strategic ambiguity and deliver complex, multi-quarter infrastructural projects from inception to completion
Deep empathy for internal data users with a passion for building robust developer tooling and abstractions

Who We Are

About Stripe

Stripe is a financial infrastructure platform for businesses. Millions of companies - from the world’s largest enterprises to the most ambitious startups - use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.

About the Team

The Batch Compute Infrastructure team at Stripe manages the foundational infrastructure, tooling, and distributed systems behind Stripe's massive-scale batch processing environments, currently encompassing over 5,000 computational nodes. Powered primarily by Hadoop, Spark, and Celeborn, these systems are the backbone for several core asynchronous financial, analytical, and regulatory workflows at Stripe, operating at petabyte scale.

**What you’ll do - **** **You will support a team of engineers focused on building the tooling, infra and systems for operating Spark, Hadoop and Celeborn at Stripe. In addition to helping define the roadmap for these systems, you will be interacting with many other managers and their teams at Stripe, who rely on the Data processing Infra team to deliver efficient and scalable services to our customers. You will work with both the finance and engineering organization (infrastructure & product) to define, measure and monitor the cost efficiency of these systems.

Responsibilities

Drive Strategic Vision: Define the multi-year roadmap for Stripe’s Batch Compute Infrastructure, leading complex architectural shifts and modernization.
Lead and Scale: Build, mentor, and aggressively scale a high-performing team of engineers, proactively investing in their career development and fostering a culture of operational excellence.
Ensure Operational Rigor: Maintain unwavering reliability for a Tier-0 infrastructure processing tens of thousands of daily workloads, proactively mitigating risks and managing complex on-call telemetry.
Cross-Functional Orchestration: Collaborate deeply with data platform teams, finance, and user groups to define compute efficiency metrics, execute massive-scale cost optimization strategies, and guarantee compliance with global financial regulations.
Technical Stewardship: Provide technical guidance in architecture reviews, evaluating critical cost, performance, and reliability trade-offs in distributed systems design involving Hadoop, Spark, AWS cloud primitives, and modern metastores.

** **Who You Are

Minimum requirements

10+ years of professional software development and engineering experience.
3+ years of direct engineering management experience, successfully building and operating high-velocity technical teams.
Deep technical background in building, scaling, and maintaining large-scale distributed data systems or Tier-0 infrastructure using open-source tools (e.g., Hadoop, Spark, Celeborn, Airflow, Kafka).
Proven track record of driving significant infrastructure efficiency, managing capacity planning, and making data-driven cost-performance trade-offs.
Experience working effectively in highly cross-functional, global organizations.

Preferred requirements

Experience managing remote or geographically distributed engineering teams.
Familiarity with managing a massive fleet of Linux servers, on-premise Hadoop clusters, and modern cloud data architectures (e.g., AWS S3, Graviton).
Demonstrated ability to navigate strategic ambiguity and deliver complex, multi-quarter infrastructural projects from inception to completion.
Deep empathy for internal data users with a passion for building robust developer tooling and abstractions.

Who We Are

About Stripe

About the Team

Responsibilities

Drive Strategic Vision: Define the multi-year roadmap for Stripe’s Batch Compute Infrastructure, leading complex architectural shifts and modernization.

Lead and Scale: Build, mentor, and aggressively scale a high-performing team of engineers, proactively investing in their career development and fostering a culture of operational excellence.

Ensure Operational Rigor: Maintain unwavering reliability for a Tier-0 infrastructure processing tens of thousands of daily workloads, proactively mitigating risks and managing complex on-call telemetry.

Cross-Functional Orchestration: Collaborate deeply with data platform teams, finance, and user groups to define compute efficiency metrics, execute massive-scale cost optimization strategies, and guarantee compliance with global financial regulations.

Technical Stewardship: Provide technical guidance in architecture reviews, evaluating critical cost, performance, and reliability trade-offs in distributed systems design involving Hadoop, Spark, AWS cloud primitives, and modern metastores.

** **Who You Are

Minimum requirements

10+ years of professional software development and engineering experience.

3+ years of direct engineering management experience, successfully building and operating high-velocity technical teams.

Deep technical background in building, scaling, and maintaining large-scale distributed data systems or Tier-0 infrastructure using open-source tools (e.g., Hadoop, Spark, Celeborn, Airflow, Kafka).

Proven track record of driving significant infrastructure efficiency, managing capacity planning, and making data-driven cost-performance trade-offs.

Experience working effectively in highly cross-functional, global organizations.

Preferred requirements

Experience managing remote or geographically distributed engineering teams.

Familiarity with managing a massive fleet of Linux servers, on-premise Hadoop clusters, and modern cloud data architectures (e.g., AWS S3, Graviton).

Demonstrated ability to navigate strategic ambiguity and deliver complex, multi-quarter infrastructural projects from inception to completion.

Deep empathy for internal data users with a passion for building robust developer tooling and abstractions.

Engineering Manager - Batch Compute Infrastructure

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Who We Are

Who We Are