Staff ML Engineer - ML Infrastructure

Samsara Samsara · Enterprise · San Francisco, CA · Remote · Safety AI

Staff ML Engineer focused on building and operating an end-to-end ML platform for industrial AI applications, covering training, experimentation, inference, and edge deployment. The role involves significant infrastructure ownership, reliability, and technical leadership.

What you'd actually do

  1. Design, build, and operate Samsara’s end-to-end ML platform (training, experimentation, batch/online inference, edge) used by multiple Safety AI product teams.
  2. Partner with product and applied ML teams to ship ML-powered features (CV models, EcoDriving insights, LLM-based reporting) that improve safety, reliability, and cost efficiency.
  3. Design and operate scalable online and batch inference systems (Ray, Spark), including deployment patterns, observability, SLOs, and unified training-to-production workflows.
  4. Own reliability, observability, and security for ML systems across cloud and edge, including on-call practices, incident response, and infrastructure hardening.
  5. Provide Staff+/Senior-Staff technical leadership on ML infrastructure architecture and strategy, influencing cross-team decisions and mentoring engineers and applied scientists.

Skills

Required

  • 10+ years of overall experience in machine learning engineering or related fields
  • strong track record of building and operating large-scale ML systems
  • Strong experience with distributed computing frameworks such as Ray and/or Spark
  • Hands-on experience with cloud infrastructure (AWS), containers/Kubernetes, and production observability tooling
  • Proven experience building or supporting ML platforms (training, experimentation, or inference) used by multiple teams
  • Solid understanding of ML fundamentals including evaluation, experiment design, and model iteration in production environments

Nice to have

  • Experience shipping ML-powered features end-to-end, from design through production and iteration, with measurable impact on product or business metrics.

What the JD emphasized

  • end-to-end ML platform
  • ML platform
  • training, experimentation, batch/online inference, edge
  • scalable online and batch inference systems
  • ML infrastructure architecture and strategy

Other signals

  • ML platform
  • inference
  • edge deployment
  • ML infrastructure