Senior Machine Learning Systems Engineer, Ads ML Experience Platform

Reddit Reddit · Consumer · United States · Remote · Ads Engineering

This role focuses on building the next generation of ML research tools and agentic AI platforms for Reddit's Ads ML lifecycle. The engineer will design and build large-scale ML experimentation platforms, production training orchestration frameworks, and an agentic AI execution platform. Experience with distributed systems, ML infrastructure, and agentic architectures is required.

What you'd actually do

  1. Design and build large-scale offline ML experimentation platforms that enable reproducible research, model development, evaluation, and promotion workflows.
  2. Develop production-grade training orchestration frameworks supporting distributed training, hyperparameter optimization, model evaluation, and automated retraining.
  3. Build infrastructure for experiment tracking, metadata management, lineage, artifact versioning, model registries, and reproducibility.
  4. Partner with ML engineers and researchers to improve experimentation velocity and operational efficiency.
  5. Build automated workflows for model promotion, rollback, compliance validation, and continuous evaluation.

Skills

Required

  • infrastructure/platform engineering
  • large-scale distributed systems
  • production ML infrastructure
  • developer SDKs
  • platform APIs
  • self-service AI tooling
  • workflow orchestration systems
  • developer platforms
  • large-scale automation frameworks
  • distributed data processing systems (Spark, Flink, Ray, or equivalent)
  • modern orchestration and workflow technologies (Kubeflow, Argo, Airflow, or similar)
  • offline ML experimentation platforms
  • model registries
  • experiment tracking systems
  • training orchestration frameworks

Nice to have

  • agentic AI execution platform
  • multi-agent orchestration
  • autonomous and human-in-the-loop workflows
  • memory/context systems
  • scalable workflow infrastructure
  • multi-agent orchestration
  • autonomous workflows
  • agent communication/runtime frameworks (e.g., MCP, A2A, and orchestration systems)
  • end-to-end model development and iteration cycles at scale

What the JD emphasized

  • deep expertise in large-scale distributed systems
  • hands-on experience building and operating production ML infrastructure
  • Experience building workflow orchestration systems
  • Experience building and operating agentic AI systems

Other signals

  • building foundational tooling for next generation machine learning devX tooling
  • design and build large-scale offline ML experimentation platforms
  • develop production-grade training orchestration frameworks
  • build automated workflows for model promotion, rollback, compliance validation, and continuous evaluation
  • design and build an agentic AI execution platform supporting autonomous and human-in-the-loop workflows