Staff Software Engineer, Machine Learning Platform

Stripe Stripe · Fintech · United States · 8212 ML Foundations

Stripe's ML Platform team is seeking a Staff Software Engineer to lead the technical direction and architecture for their ML infrastructure. This role will focus on building and operating platforms for ML training, model serving, feature stores, and agentic AI capabilities, aiming to increase ML velocity across the company. The engineer will own end-to-end system design, define strategy for next-generation ML infrastructure, and work cross-functionally to deliver robust solutions.

What you'd actually do

  1. Take ownership of end-to-end architecture and system design for large, complex projects across ML Platform.
  2. Define technical direction for highly ambiguous projects, transforming complex user needs into long-lasting platform strategy.
  3. Design system architectures for the most challenging ML Platform problems in one or more areas, including AI and ML workflow orchestration, scalable CPU and GPU compute infrastructure, model training, LLM fine-tuning, low-latency model inference, large-scale feature stores, real-time monitoring, and LLM and agent orchestration.
  4. Turn high-leverage ideas into tangible, robust solutions that shape platform and product roadmap, combining technical excellence with creative problem-solving.
  5. Scope and lead large projects with significant business impact, driving them from requirements through design, implementation, and production operation.

Skills

Required

  • 10+ years of professional software development experience
  • service-oriented architecture
  • large-scale distributed systems
  • technical lead
  • production ML platform
  • performance
  • reliability
  • scalability
  • cost efficiency
  • product instincts
  • business context
  • communication skills
  • cross-functional collaboration
  • autonomy
  • responsibility
  • ambiguous environments
  • AI tools

Nice to have

  • large-scale ML training
  • serving
  • data infrastructure
  • distributed training
  • model inference
  • feature stores
  • real-time feature computation
  • model registries
  • accelerator-backed compute
  • training data pipelines
  • experiment tracking
  • model evaluation
  • rapidly developing prototypes
  • iterating based on user feedback
  • training and shipping machine learning models to production
  • LLMs
  • LLM application frameworks
  • agentic AI patterns

What the JD emphasized

  • production ML platform
  • low-latency model inference
  • LLM and agent orchestration
  • large-scale distributed systems
  • production ML platform

Other signals

  • ML Platform
  • ML infrastructure
  • model serving
  • LLM inference
  • agent orchestration