Principal Ai/ml Engineer, Reliability

Roblox Roblox · Consumer · San Mateo, CA · Machine Learning

Principal AI/ML Engineer focused on leveraging ML to improve the reliability of the Roblox platform by detecting issues proactively and reducing incident resolution time. This involves building data pipelines, anomaly detection systems, reasoning layers for root cause analysis, and time-series models for capacity prediction.

What you'd actually do

  1. Define the strategy of leveraging Machine Learning Engineering to improve Production Systems Reliability at Roblox.
  2. Improve realtime anomaly detection capabilities by leveraging various state of the art ML techniques, thereby directly contributing to improving Mean Time to Detect Production issues.
  3. Develop methods to build pipelines to consume various streams of data (metrics, logs, traces, change management systems etc.).
  4. Build a reasoning layer that interacts with the streams of data to find possible root causes of problems happening in production.
  5. Build time-series models to predict capacity exhaustion and seasonal traffic spikes to drive automated scaling

Skills

Required

  • Machine Learning Engineering
  • Production Systems Reliability
  • realtime anomaly detection
  • ML techniques
  • data pipelines
  • time-series models
  • distributed systems fundamentals
  • large scale high throughput systems

Nice to have

  • logs
  • traces
  • metrics
  • production changes

What the JD emphasized

  • set the 3-5 year technical strategy and architectural blueprint
  • own the architectural and execution roadmap
  • expert who has knowledge of various modeling techniques, ability to go deep and fine tune models

Other signals

  • improve reliability of the overall Roblox platform
  • proactively detect issues before they become real problems
  • reduce time to resolve incidents
  • realtime anomaly detection
  • time-series models to predict capacity exhaustion