Software Engineer, Machine Learning Platform

Chime Chime · Fintech · San Francisco, CA · Data Engineering

Chime's Machine Learning Platform (MLP) team builds and operates the infrastructure, tooling, and developer experience that powers machine learning across the company. This role focuses on building robust foundations that allow ML teams to move quickly while maintaining reliability, governance, and cost efficiency. The engineer will design and build scalable systems that support model training, feature computation, real-time inference, and experimentation, working at the intersection of distributed systems, cloud infrastructure, and applied machine learning.

What you'd actually do

  1. Design, build, and operate scalable ML infrastructure on AWS
  2. Develop distributed training and batch processing systems using Ray
  3. Build and maintain infrastructure-as-code using Terraform
  4. Support and evolve the feature store and feature pipelines
  5. Develop data ingestion and streaming systems (e.g., Kinesis, Kafka, Flink, Spark, or similar technologies)

Skills

Required

  • ML infrastructure
  • platform engineering
  • production ML systems
  • machine learning model development lifecycle
  • data preprocessing
  • model training
  • evaluation
  • deployment
  • distributed systems
  • cloud computing
  • large-scale data processing
  • computer science
  • software engineering principles
  • CI/CD pipelines
  • DevOps practices
  • infrastructure as code
  • containerization
  • Docker
  • Kubernetes
  • orchestration systems
  • AWS
  • Spark
  • Ray
  • GPU programming
  • CUDA
  • Python
  • Go
  • Scala
  • Java
  • Terraform
  • CloudFormation
  • testing
  • version control
  • code review
  • observability

Nice to have

  • distributed compute frameworks
  • feature store
  • real-time ML systems
  • model serving
  • streaming technologies
  • Kafka
  • Kinesis
  • Flink
  • Spark Streaming
  • ML lifecycle workflows
  • ML experimentation platforms
  • model governance practices

What the JD emphasized

  • 5+ years of experience in ML infrastructure, platform engineering, or production ML systems
  • Knowledge of the machine learning model development lifecycle, including data preprocessing, model training, evaluation, and deployment
  • Experience with distributed systems, cloud computing, or large-scale data processing
  • Strong foundation in computer science and software engineering principles
  • Deeply interested in the impact and evolution of advanced AI technologies
  • Hands-on experience with CI/CD pipelines, DevOps practices, and infrastructure as code
  • Experience with containerization technologies such as Docker and Kubernetes, and orchestration systems
  • Knowledge of cloud platforms such as AWS and distributed computing frameworks such as Spark and Ray
  • Experience with GPU programming(CUDA) and GPU costs/optimization
  • Strong programming skills in Python, Go, Scala, Java or similar languages
  • Familiarity with infrastructure-as-code (e.g., Terraform, CloudFormation)
  • Solid understanding of software engineering fundamentals (testing, version control, code review, observability)

Other signals

  • ML infrastructure
  • developer experience
  • model training
  • feature computation
  • real-time inference
  • experimentation