Senior Machine Learning Engineer, Reliability

Roblox Roblox · Consumer · San Mateo, CA · Software Engineering

Senior Machine Learning Engineer focused on improving the reliability of the Roblox platform by leveraging ML for anomaly detection, root cause analysis, and predictive scaling. The role involves building ML systems that consume and reason over large-scale data streams (logs, traces, metrics) to proactively identify and resolve production issues.

What you'd actually do

  1. Help define the roadmap for leveraging Machine Learning Engineering to improve Production Systems Reliability at Roblox.
  2. Improve realtime anomaly detection capabilities by leveraging various state of the art ML techniques, thereby directly contributing to improving Mean Time to Detect Production issues.
  3. Develop methods to build pipelines to consume various streams of data (metrics, logs, traces, change management systems etc.).
  4. Build a reasoning layer that interacts with the streams of data to find possible root causes of problems happening in production.
  5. Build time-series models to predict capacity exhaustion and seasonal traffic spikes to drive automated scaling

Skills

Required

  • Machine Learning Engineering
  • Production Systems Reliability
  • Anomaly Detection
  • Time-series modeling
  • Data Pipelines
  • Distributed Systems
  • High Throughput Systems

Nice to have

  • fine tune models
  • architect infrastructure
  • learn from user and/or automated feedback

What the JD emphasized

  • expert who has knowledge of various modeling techniques, ability to to go deep and fine tune models to fit our use cases
  • Ability to propose and architect the infrastructure that allows us to implement systems that learn from user and/or automated feedback
  • Good distributed systems fundamentals and understanding of large scale high throughput systems

Other signals

  • improve reliability of the overall Roblox platform
  • proactively detect issues before they become real problems
  • reduce time to resolve incidents
  • realtime anomaly detection
  • predict capacity exhaustion and seasonal traffic spikes