Software Engineer, Machine Learning Infrastructure

Whatnot · Consumer · San Francisco, CA · Engineering

Software Engineer, Machine Learning Infrastructure at Whatnot, focusing on scaling AI and ML infrastructure for large language models and other ML applications. Responsibilities include owning AI/ML infrastructure, prototyping and productionizing ML architectures, designing and scaling inference infrastructure for low-latency and high-throughput serving, and building distributed training and inference pipelines.

What you'd actually do

  1. Own the infrastructure powering AI and ML models across critical business surfaces–supporting growth, recommendations, trust and safety, fraud, seller tooling, and more.
  2. Prototype, deploy, and productionalize novel ML architectures that directly shape user experience and marketplace dynamics.
  3. Design and scale inference infrastructure capable of serving large models with low latency and high throughput.
  4. Build distributed training and inference pipelines leveraging GPUs and both model and data parallelism.

Skills

Required

  • 4+ years of professional experience developing machine learning systems and algorithms
  • Bachelor’s degree in Computer Science, Statistics, Applied Mathematics or a related technical field, or equivalent work experience
  • 3+ years of software engineering experience building and maintaining production systems for consumer-scale loads
  • 1+ years of professional experience developing software in Python
  • Ability to work autonomously and drive initiatives across multiple product areas and communicate findings with leadership and product teams
  • Experience with operational, search, and key-value databases such as PostgreSQL, DynamoDB, Elasticsearch, Redis
  • Firm grasp of visualization tools for monitoring and logging e.g. DataDog, Grafana
  • Familiarity with cloud computing platforms and managed services such as AWS Sagemaker, Lambda, Kinesis, S3, EC2, EKS/ECS, Apache Kafka, Flink
  • Professionalism around collaborating in a remote working environment and well tested, reproducible work
  • Exceptional documentation and communication skills

What the JD emphasized

  • low-latency
  • high throughput
  • distributed training
  • large model serving

Other signals

  • building systems that make advanced ML dependable and fast at scale
  • low-latency, large model serving
  • distributed training & high-throughput GPU inference