Software Engineer, Machine Learning Infrastructure

Stripe Stripe · Fintech · Canada · 8212 ML Foundations

Stripe's ML Infra team is seeking a Software Engineer to build and scale the ML lifecycle services, including training, serving, and LLM applications, to accelerate AI/ML adoption across the company. The role focuses on designing and implementing robust, high-availability infrastructure for production ML platforms.

What you'd actually do

  1. Designing and building scalable, reliable, and secure services for notebooks, ML model training, experimentation, serving, and LLM applications across multiple regions.
  2. Creating services and libraries that enable ML engineers at Stripe to seamlessly transition from experimentation to production across Stripe’s systems.
  3. Working directly with product teams and ML engineers to improve their day-to-day productivity.
  4. Taking ownership of and finding solutions for technical and product challenges by working with a diverse set of systems, processes, and technologies.

Skills

Required

  • 2+ years of professional software development experience
  • solid background on service oriented architecture
  • large-scale distributed systems
  • full life cycle of software development
  • production ML platforms
  • MLOps solutions
  • building LLM applications
  • running operations for high availability, low latency systems
  • partnering with other teams to drive business outcomes

Nice to have

  • Experience building and shipping production AI agents
  • Familiarity with the LLMs and LLM Frameworks
  • Experience training and shipping machine learning models to production to solve critical business problems

What the JD emphasized

  • production ML platforms
  • high availability, low latency systems
  • building LLM applications
  • production AI agents
  • training and shipping machine learning models to production

Other signals

  • ML Infra team builds services and tools that power every step in the ML lifecycle
  • accelerate the adoption of AI/ML across all parts of the company
  • building highly scalable and reliable foundational infrastructure
  • Designing and building scalable, reliable, and secure services for notebooks, ML model training, experimentation, serving, and LLM applications
  • Creating services and libraries that enable ML engineers at Stripe to seamlessly transition from experimentation to production
  • Experience running operations for high availability, low latency systems