Senior Machine Learning Engineer

DocuSign DocuSign · Enterprise · San Francisco, CA +2 · Engineering

Senior Machine Learning Engineer to build the 'brain' of Docusign's global services infrastructure, focusing on a self-healing ecosystem using Multi-Agent Systems, Reinforcement Learning, and LLMs for autonomous incident detection, troubleshooting, and resolution. The role involves designing and implementing autonomous remediation systems, developing GenAI agents for root cause analysis, deploying deep learning models for time series data, optimizing inference pipelines, and owning the model lifecycle from feature engineering to production monitoring.

What you'd actually do

  1. Design and implement autonomous multi-agent systems using Reinforcement Learning (RL) loops that can interact with our infrastructure to perform safe, automated remediation actions
  2. Build GenAI agents capable of digesting logs, traces, and metrics to provide "Human-in-the-loop" root cause analysis and conversational debugging for our SREs
  3. Develop and deploy deep learning models (Transformers, LSTMs, etc.) for forecasting and anomaly detection on high-cardinality, high-volume time series data
  4. Optimize inference pipelines to run with low latency on streaming telemetry data (Kafka/Flink), ensuring we catch issues the moment they happen
  5. Own the lifecycle of your models—from feature engineering on petabyte-scale datasets to training, deployment, and monitoring in production Kubernetes environments

Skills

Required

  • 8+ years of professional experience in Machine Learning Engineering or Data Science
  • PyTorch or TensorFlow
  • Time Series analysis (forecasting/anomaly detection)
  • NLP
  • building applications using LLMs (RAG pipelines, LangChain, vector databases)
  • technical domains (code analysis, log parsing)
  • RL concepts (policies, rewards, agents)
  • optimization or control problems
  • distributed data processing and streaming technologies (Apache Spark, Kafka, Flink)
  • software engineering fundamentals (Python, C++, or Go)
  • CI/CD for ML
  • deploying models via APIs (FastAPI, Triton Inference Server)

Nice to have

  • the "three pillars" (Logs, Metrics, Traces)
  • Prometheus, Grafana, OpenTelemetry, or Jaeger
  • AutoGen, CrewAI, or Ray RLlib
  • AWS/GCP/Azure
  • Kubernetes (K8s) orchestration
  • control theory
  • causal inference

What the JD emphasized

  • moving beyond simple anomaly detection
  • Multi-Agent Systems
  • Reinforcement Learning
  • Large Language Models (LLMs)
  • detect incidents in real-time but to troubleshoot and resolve them autonomously
  • massive datasets (billions of telemetry points)
  • solve real-world reliability challenges
  • petabyte-scale datasets
  • low latency on streaming telemetry data
  • catch issues the moment they happen

Other signals

  • Multi-Agent Systems
  • Reinforcement Learning
  • LLMs
  • autonomous remediation
  • time series forecasting
  • anomaly detection
  • low latency inference