ML Systems Engineer, Robotics

Scale AI Scale AI · Data AI · San Francisco, CA · AVCV / Robotics EPD

ML Systems Engineer focused on building and scaling serving platforms for robotics-related foundation models, optimizing algorithms for cloud GPUs, and developing internal platforms for model capability discovery. The role involves backend system design, ML infrastructure, and ensuring low latency for real-time applications.

What you'd actually do

  1. Build & Scale: Maintain fault-tolerant, high-performance systems for serving robotics-related models and foundation models at scale, ensuring low latency for real-time applications.
  2. Platform Development: Build an internal platform to empower model capability discovery, enabling faster iteration cycles for research teams working on robotics.
  3. Collaborate: Work closely with Robotics researchers and Computer Vision engineers to integrate and optimize models for production and research environments.
  4. Design Excellence: Conduct architecture and design reviews to uphold best practices in system scalability, reliability, and security.
  5. Observability: Develop monitoring and observability solutions to ensure system health and real-time performance tracking of model inference.

Skills

Required

  • building large-scale, high-performance backend systems
  • machine learning infrastructure
  • optimizing computer vision and other machine learning algorithms for cloud environments
  • GPU-level algorithm optimizations (e.g., CUDA, kernel tuning)
  • Python
  • Go
  • Rust
  • C++
  • serving and routing fundamentals (e.g., rate limiting, load balancing, compute budgets, concurrency)
  • containers (Docker)
  • orchestration (Kubernetes)
  • cloud providers (AWS/GCP)
  • infrastructure as code (e.g., Terraform)

Nice to have

  • Vision-Language-Action (VLA) models
  • high-performance video processing (e.g., FFmpeg, NVDEC/NVENC)
  • 3D data handling (point clouds)
  • robotics middleware (e.g., ROS/ROS2)
  • AV data formats

What the JD emphasized

  • ML Systems Engineer
  • physical agents
  • physical AI
  • robotics
  • foundation models
  • serving
  • backend system design
  • ML fundamentals
  • ML infrastructure
  • algorithm optimization
  • GPU-level algorithm optimizations
  • systems-level languages
  • serving and routing fundamentals
  • data-intensive applications
  • containers
  • orchestration
  • cloud providers
  • infrastructure as code

Other signals

  • ML pipelines for processing, training, and fine-tuning
  • optimizing algorithms and pipelines to run efficiently on GPUs
  • scalable, reliable, and efficient serving of foundation models
  • optimizing computer vision and other machine learning algorithms for cloud environments