Senior Engineering Manager, ML Platform

Whatnot · Consumer · San Francisco, CA · Engineering

Senior Engineering Manager, ML Platform at Whatnot, a livestream shopping platform. This role focuses on leading the development and scaling of core infrastructure for machine learning and self-hosted LLM applications. Responsibilities include building low-latency model serving, streaming feature ingestion, distributed training, and high-throughput GPU inference systems. The role requires strong technical depth, hands-on coding, and managing production ML systems at consumer scale.

What you'd actually do

  1. Own the infrastructure powering AI and ML models across critical business surfaces–supporting growth, recommendations, trust and safety, fraud, seller tooling, and more.
  2. Guide the prototyping, deployment, and productionization of novel ML architectures that directly shape user experience and marketplace dynamics.
  3. Help design and scale inference infrastructure capable of serving large models with low latency and high throughput.
  4. Oversee and evolve real-time feature pipelines that feed both our online and offline stores, ensuring single-second feedback from behavioral signals, high reliability, and model training fidelity.
  5. Drive feature platform improvements and expand scope to cover non-ML use cases such as fraud rules where point-in-time backtesting is also critical.

Skills

Required

  • 4+ years of engineering management experience developing production machine learning systems at consumer-scale loads
  • 5+ years of hands-on software engineering experience building and maintaining production systems for consumer-scale loads
  • 1+ years of professional experience developing software in Python
  • Ability to work autonomously and drive initiatives across multiple product areas and communicate findings with leadership and product teams.
  • Experience with operational, search, and key-value databases such as PostgreSQL, DynamoDB, Elasticsearch, Redis.
  • Experience working with with ML-specific tools and frameworks such as MLFlow, LitServe, TorchServe, Triton
  • Firm grasp of visualization tools for monitoring and logging e.g. DataDog, Grafana.
  • Familiarity with cloud computing platforms and managed services such as AWS Sagemaker, Lambda, Kinesis, S3, EC2, EKS/ECS, Apache Kafka, Flink.
  • Professionalism around collaborating in a remote working environment and well tested, reproducible work.
  • Exceptional documentation and communication skills.

Nice to have

  • Bachelor’s degree in Computer Science, Statistics, Applied Mathematics or a related technical field, or equivalent work experience.

What the JD emphasized

  • production machine learning systems at consumer-scale loads
  • low-latency deep learning model serving
  • streaming feature ingestion
  • distributed training
  • high-throughput GPU inference

Other signals

  • building systems that make advanced ML dependable and fast at scale
  • low-latency deep learning model serving
  • streaming feature ingestion
  • distributed training
  • high-throughput GPU inference
  • production machine learning systems at consumer-scale loads