Machine Learning Infrastructure Engineer

Whatnot · Consumer · San Francisco, CA · Engineering

Seeking an ML Infrastructure Engineer to design and scale core infrastructure for ML and LLM applications, focusing on low-latency serving, distributed training, and high-throughput GPU inference to productionize cutting-edge models.

What you'd actually do

  1. Own the infrastructure powering AI and ML models across critical business surfaces–supporting growth, recommendations, trust and safety, fraud, seller tooling, and more.
  2. Prototype, deploy, and productionalize novel ML architectures that directly shape user experience and marketplace dynamics.
  3. Design and scale inference infrastructure capable of serving large models with low latency and high throughput.
  4. Build distributed training and inference pipelines leveraging GPUs and both model and data parallelism.
  5. Stretch beyond your comfort zone to take on new technical challenges as we scale AI across Whatnot’s ecosystem.

Skills

Required

  • 4+ years of professional experience developing machine learning systems and algorithms
  • 3+ years of software engineering experience building and maintaining production systems for consumer-scale loads
  • 1+ years of professional experience developing software in Python
  • Ability to work autonomously and drive initiatives across multiple product areas and communicate findings with leadership and product teams.
  • Experience with operational, search, and key-value databases such as PostgreSQL, DynamoDB, Elasticsearch, Redis.
  • Firm grasp of visualization tools for monitoring and logging e.g. DataDog, Grafana.
  • Familiarity with cloud computing platforms and managed services such as AWS Sagemaker, Lambda, Kinesis, S3, EC2, EKS/ECS, Apache Kafka, Flink.
  • Professionalism around collaborating in a remote working environment and well tested, reproducible work.
  • Exceptional documentation and communication skills.

Nice to have

  • Bachelor’s degree in Computer Science, Statistics, Applied Mathematics or a related technical field, or equivalent work experience.

What the JD emphasized

  • production systems
  • low-latency
  • high throughput
  • distributed training
  • GPU inference

Other signals

  • building systems that make advanced ML dependable and fast at scale
  • low-latency, large model serving
  • distributed training
  • high-throughput GPU inference