Staff Machine Learning Systems Engineer

Reddit Reddit · Consumer · United States · Remote · Machine Learning

Staff ML Systems Engineer to lead development of a platform for large scale ML models at Reddit, focusing on MLOps, graph ML, performance tuning, and data processing pipelines.

What you'd actually do

  1. Design end-to-end model lifecycle patterns (MLOps) to boost velocity of development for ML engineers, including data preparation, model management, experiment tracking, and more
  2. Zero-to-one development and support of a graph ML codebase and platform that abstracts away common patterns and enables greater model scalability and iteration
  3. Collaborate with ML engineers on performance tuning, including improving model training time, efficiency, and GPU training costs in a large, distributed ML training environment
  4. Optimize batch data processing within a data warehouse and with tools such as Apache Beam, Apache Spark, Ray Data, and more
  5. Architect pipelines to build and maintain massive graph data structures on the order of billions of nodes and tens of billions of edges

Skills

Required

  • ML infrastructure
  • model training
  • model deployments
  • ML optimization
  • memory and GPU profiling
  • cloud-based technologies
  • GCP BigQuery
  • Google Cloud Storage
  • infrastructure-as-code (Terraform)
  • MLOps tools
  • experiment tracking
  • model serving
  • model registries
  • Python
  • PyTorch
  • Tensorflow
  • distributed training frameworks
  • Ray
  • Kubernetes
  • scalability
  • reliability
  • performance
  • ease of use
  • machine learning development lifecycle

Nice to have

  • graph databases (Neo4j, JanusGraph, TigerGraph)
  • graph neural networks (GNNs)
  • graph ML frameworks (PyTorch Geometric, Deep Graph Library)

What the JD emphasized

  • 8+ years of experience in ML infrastructure
  • Deep experience with cloud-based technologies for supporting an ML platform
  • Deep experience working with distributed training frameworks
  • Strong focus on scalability, reliability, performance, and ease of use

Other signals

  • MLOps platform
  • model lifecycle
  • distributed training
  • graph ML
  • ML infrastructure