Senior ML Engineer, Computer Vision - Applied AI

Uber Uber · Consumer · San Francisco, CA +1 · Engineering

Senior ML Engineer focused on Computer Vision and Multimodal models for document understanding and transcription at Uber. The role involves developing, training, and deploying models into production systems, optimizing them for performance and efficiency, and analyzing production data for continuous improvement. It requires a strong foundation in deep learning, Python, and ML frameworks, with experience in deploying models and working with real-world datasets.

What you'd actually do

  1. Develop and train state-of-the-art computer vision and multimodal models (e.g., Vision-Language Models such as Gemini or similar foundation models) to transcribe and understand diverse document types including identity documents, receipts, invoices, and restaurant menus.
  2. Design and implement scalable vision systems, combining on-device and server-side models to balance latency, accuracy, privacy, and cost efficiency.
  3. Collaborate closely with ML Infrastructure and Earner/Product teams to define data requirements, labeling strategies, evaluation metrics, and integration pathways into the broader ML lifecycle.
  4. Own the full system lifecycle, from advanced model development and experimentation to production deployment, monitoring, and scaling for high-throughput applications.
  5. Build and maintain robust evaluation frameworks to measure transcription accuracy, document understanding performance, and model robustness across diverse geographies and document formats.

Skills

Required

  • Python
  • PyTorch
  • JAX
  • TensorFlow
  • computer vision
  • multimodal systems
  • deep learning fundamentals
  • model training
  • model evaluation
  • model debugging
  • production deployment
  • real-world datasets
  • problem-solving
  • cross-functional collaboration

Nice to have

  • robotics
  • embodied AI
  • large-scale vision models
  • foundation models
  • object detection
  • segmentation
  • OCR
  • document layout understanding
  • point cloud processing
  • edge devices
  • mobile platforms
  • model compression
  • quantization
  • pruning
  • TensorFlow Lite
  • ONNX
  • large-scale document datasets
  • data curation
  • data augmentation
  • distributed training systems
  • scalable ML infrastructure

What the JD emphasized

  • 5+ years of hands-on experience in machine learning, with a strong focus on computer vision or multimodal systems.
  • Experience deploying ML models into production systems and working with real-world datasets.
  • 5+ years of hands-on ML experience, preferably in robotics, computer vision, or embodied AI.
  • Strong experience training and optimizing large-scale vision or multimodal models, including Vision-Language Models (VLMs) or foundation models.

Other signals

  • develop and deploy state-of-the-art vision and multimodal models
  • integrate them into production systems that operate reliably at large scale
  • Own the full system lifecycle, from advanced model development and experimentation to production deployment, monitoring, and scaling