What you'd actually do

Develop and train state-of-the-art computer vision and multimodal models (e.g., Vision-Language Models such as Gemini or similar foundation models) to transcribe and understand diverse document types including identity documents, receipts, invoices, and restaurant menus.

Design and implement scalable vision systems, combining on-device and server-side models to balance latency, accuracy, privacy, and cost efficiency.

Collaborate closely with ML Infrastructure and Earner/Product teams to define data requirements, labeling strategies, evaluation metrics, and integration pathways into the broader ML lifecycle.

Own the full system lifecycle, from advanced model development and experimentation to production deployment, monitoring, and scaling for high-throughput applications.

Build and maintain robust evaluation frameworks to measure transcription accuracy, document understanding performance, and model robustness across diverse geographies and document formats.

Skills

Required

Python
PyTorch
JAX
TensorFlow
computer vision
multimodal systems
deep learning fundamentals
model training
model evaluation
model debugging
production deployment
real-world datasets
problem-solving
cross-functional collaboration

Nice to have

robotics
embodied AI
large-scale vision models
foundation models
object detection
segmentation
OCR
document layout understanding
point cloud processing
edge devices
mobile platforms
model compression
quantization
pruning
TensorFlow Lite
ONNX
large-scale document datasets
data curation
data augmentation
distributed training systems
scalable ML infrastructure

What the JD emphasized

5+ years of hands-on experience in machine learning, with a strong focus on computer vision or multimodal systems.

Experience deploying ML models into production systems and working with real-world datasets.

5+ years of hands-on ML experience, preferably in robotics, computer vision, or embodied AI.

Strong experience training and optimizing large-scale vision or multimodal models, including Vision-Language Models (VLMs) or foundation models.

About the Role

Applied AI at Uber builds intelligent systems that power critical product experiences across the platform. As a Senior Machine Learning Engineer — Computer Vision, you will develop and deploy state-of-the-art vision and multimodal models that enable scalable document understanding and transcription systems across Uber’s ecosystem.

Your work will power high-impact applications such as earner onboarding verification, receipt transcription, restaurant menu digitization, and other document intelligence workflows. You will design, train, and optimize modern computer vision and vision-language models, integrating them into production systems that operate reliably at large scale.

This role combines deep model development expertise with production engineering rigor — ideal for someone who thrives at the intersection of research innovation and real-world deployment.

What the Candidate Will Do:

Develop and train state-of-the-art computer vision and multimodal models (e.g., Vision-Language Models such as Gemini or similar foundation models) to transcribe and understand diverse document types including identity documents, receipts, invoices, and restaurant menus.
Design and implement scalable vision systems, combining on-device and server-side models to balance latency, accuracy, privacy, and cost efficiency.
Collaborate closely with ML Infrastructure and Earner/Product teams to define data requirements, labeling strategies, evaluation metrics, and integration pathways into the broader ML lifecycle.
Own the full system lifecycle, from advanced model development and experimentation to production deployment, monitoring, and scaling for high-throughput applications.
Build and maintain robust evaluation frameworks to measure transcription accuracy, document understanding performance, and model robustness across diverse geographies and document formats.
Optimize models for performance and efficiency, including model compression, quantization, and hardware-aware optimization for mobile or edge deployment when required.
Analyze production data and failure cases to continuously improve model quality, generalization, and system reliability.

Basic Qualifications:

5+ years of hands-on experience in machine learning, with a strong focus on computer vision or multimodal systems.
Solid foundation in deep learning fundamentals, including training, evaluation, and debugging of neural networks.
Proficiency in Python and modern ML frameworks such as PyTorch, JAX, or TensorFlow (Lite).
Experience deploying ML models into production systems and working with real-world datasets.
Strong problem-solving skills and ability to work cross-functionally in product-driven environments.

Preferred Qualifications:

5+ years of hands-on ML experience, preferably in robotics, computer vision, or embodied AI.
Strong experience training and optimizing large-scale vision or multimodal models, including Vision-Language Models (VLMs) or foundation models.
Deep understanding of computer vision techniques such as object detection, segmentation, OCR, document layout understanding, and point cloud processing.
Experience deploying models to edge devices or mobile platforms, including performance optimization (quantization, pruning, TensorFlow Lite, ONNX, etc.).
Experience working with large-scale document or visual datasets, including data curation and augmentation strategies.
Familiarity with distributed training systems and scalable ML infrastructure.

For San Francisco, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year.

For Seattle, WA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year.

For Sunnyvale, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year.

For all US locations, you will be eligible to participate in Uber's bonus program, and may be offered an equity award & other types of comp. All full-time employees are eligible to participate in a 401(k) plan. You will also be eligible for various benefits. More details can be found at the following link https://jobs.uber.com/en/benefits.

Uber's mission is to reimagine the way the world moves for the better. Here, bold ideas create real-world impact, challenges drive growth, and speed fuels progress. What moves us, moves the world - let's move it forward, together.

Uber is proud to be an Equal Opportunity employer. All qualified applicants will receive consideration for employment without regard to sex, gender identity, sexual orientation, race, color, religion, national origin, disability, protected Veteran status, age, or any other characteristic protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you have a disability or special need that requires accommodation, please let us know by completing this form.

Offices continue to be central to collaboration and Uber's cultural identity. Unless formally approved to work fully remotely, Uber expects employees to spend at least half of their work time in their assigned office. For certain roles, such as those based at green-light hubs, employees are expected to be in-office for 100% of their time. Please speak with your recruiter to better understand in-office expectations for this role.