Staff Software Engineer, Stream Compute

Stripe Stripe · Fintech · United States · 8127 Core Infrastructure

Staff Software Engineer to build and operate Flink-powered stream processing systems at scale for a fintech company. Focus on reliability, state management, exactly-once processing, and automation. Requires experience with distributed systems, big data technologies like Flink/Kafka, and infrastructure as a product.

What you'd actually do

  1. Design, build, and operate stream compute infrastructure with Apache Flink at the center, alongside technologies like Kafka, Temporal, and AWS services
  2. Partner with product and platform teams across Stripe to understand requirements, unblock Flink adoption, and improve how stream processing infrastructure is used end-to-end
  3. Define and implement operational best practices (e.g., shuffle sharding, cellular architecture, load shedding, automated state recovery) to improve resilience and reliability at scale
  4. Drive fleet-level automation and standardization ("pets" to "cattle") through self-service workflows, safer rollouts, and self-healing systems that reduce manual operations
  5. Lead initiatives that raise the bar on Flink availability and state durability (e.g., multi-region strategies, disaster recovery readiness, operational readiness reviews, incident learning)

Skills

Required

  • 10+ years of experience building, operating, and evolving large-scale production systems
  • Experience as a technical lead for team(s) working on distributed systems, including scaling them in fast-moving environments
  • Hands-on experience with big data technologies such as Flink, Spark, Kafka, Pulsar, or Pinot
  • Experience developing, maintaining and debugging distributed systems built with open source tools
  • Experience building and scaling infrastructure as a product
  • Strong software engineering skills
  • Ability to write high quality code (in programming languages like Go, Java, Scala, etc)
  • Comfortable operating with high autonomy and ownership
  • Strong written and verbal communication skills, including the ability to produce clear technical documentation

Nice to have

  • Experience operating streaming infrastructure as a platform (e.g., Flink clusters, Kafka, Pulsar) for internal customers at scale
  • Deep hands-on experience authoring, optimizing, and operating real-time processing frameworks such as Flink, Spark Streaming, Storm, or Kafka Streams in production
  • Experience building or operating control planes for managing large-scale infrastructure
  • Open source contributions to data processing or big data systems (Hadoop, Spark, Celeborn, Flink, etc)

What the JD emphasized

  • Flink-powered stream processing systems
  • significant scale
  • various critical financial operations and real-time analytics
  • intersection of real-time data processing and fintech innovation
  • innovation, user experience, reliability, and compliance
  • crucial part of Stripe's success
  • Flink-first stream compute infrastructure
  • extremely high availability targets at global scale
  • operating Flink in production
  • state management
  • exactly-once processing
  • performance isolation
  • automated recovery
  • stateful stream processing applications
  • Apache Flink
  • Kafka
  • Temporal
  • AWS services
  • Flink adoption
  • stream processing infrastructure
  • operational best practices
  • resilience and reliability at scale
  • fleet-level automation and standardization
  • self-service workflows
  • safer rollouts
  • self-healing systems
  • Flink availability and state durability
  • multi-region strategies
  • disaster recovery readiness
  • operational readiness reviews
  • incident learning
  • Flink ecosystem capabilities
  • developer experience
  • scalability
  • reliability
  • open source community
  • 10+ years of experience building, operating, and evolving large-scale production systems
  • technical lead for team(s) working on distributed systems
  • scaling them in fast-moving environments
  • big data technologies such as Flink, Spark, Kafka, Pulsar, or Pinot
  • developing, maintaining and debugging distributed systems built with open source tools
  • building and scaling infrastructure as a product
  • Big Data Distributed Systems
  • operating streaming infrastructure as a platform
  • real-time processing frameworks such as Flink, Spark Streaming, Storm, or Kafka Streams in production
  • building or operating control planes for managing large-scale infrastructure
  • Open source contributions to data processing or big data systems