What you'd actually do

Design and train large-scale video foundation models on diverse datasets spanning internet-scale video and robot-collected data

Develop pretraining strategies that capture temporal dynamics, motion, and object interaction from raw video sequences

Build models that learn transferable representations for downstream tasks such as perception, tracking, prediction, and control

Explore architectures for video understanding and generation, including transformer-based and diffusion-based approaches

Implement efficient data pipelines and training strategies for high-throughput video ingestion and large-scale distributed training

Skills

Required

Experience training large-scale models on video data or other high-dimensional sequential modalities
Strong understanding of modern deep learning architectures for video, vision, or multimodal systems
Experience with large-scale pretraining, including dataset curation, training dynamics, and scaling laws
Proficiency in Python and deep learning frameworks such as PyTorch
Experience working with distributed training systems and large GPU clusters
Strong experimental rigor and ability to iterate quickly on model design and training strategies
Solid software engineering skills and ability to build scalable, reliable systems
Ability to operate independently and drive ambiguous, high-impact research directions

Nice to have

Experience working on frontier video models or multimodal foundation models
Background in video diffusion, autoregressive video modeling, or world models
Experience at leading AI labs such as OpenAI, Google DeepMind, Google, ByteDance, Midjourney, or Adobe
Experience with large-scale dataset construction and filtering for video pretraining
Familiarity with robotics, embodied AI, or learning from egocentric / first-person video
Publication record in machine learning, computer vision, or multimodal AI

Figure is an AI robotics company developing autonomous general-purpose humanoid robots. Our goal is to build embodied AI systems that can perceive, reason, and act in the real world. Figure is headquartered in San Jose, CA, and this role requires 5 days/week in-office collaboration.

Our Helix team is responsible for developing the core AI systems that power humanoid autonomy. We are looking for a Helix AI Engineer, Video Pretraining to lead the development of large-scale video foundation models trained on diverse real-world and robot-collected data.

This role focuses on pretraining models that learn from raw video—capturing motion, interaction, and temporal structure—to enable downstream capabilities in perception, prediction, and embodied reasoning.

Responsibilities

Design and train large-scale video foundation models on diverse datasets spanning internet-scale video and robot-collected data
Develop pretraining strategies that capture temporal dynamics, motion, and object interaction from raw video sequences
Build models that learn transferable representations for downstream tasks such as perception, tracking, prediction, and control
Explore architectures for video understanding and generation, including transformer-based and diffusion-based approaches
Implement efficient data pipelines and training strategies for high-throughput video ingestion and large-scale distributed training
Optimize model performance across compute, memory, and training efficiency constraints
Collaborate closely with generative modeling, agent, and robot learning teams to integrate pretrained models into the autonomy stack
Design evaluation frameworks and benchmarks to measure temporal understanding, prediction quality, and generalization

Requirements

Experience training large-scale models on video data or other high-dimensional sequential modalities
Strong understanding of modern deep learning architectures for video, vision, or multimodal systems
Experience with large-scale pretraining, including dataset curation, training dynamics, and scaling laws
Proficiency in Python and deep learning frameworks such as PyTorch
Experience working with distributed training systems and large GPU clusters
Strong experimental rigor and ability to iterate quickly on model design and training strategies
Solid software engineering skills and ability to build scalable, reliable systems
Ability to operate independently and drive ambiguous, high-impact research directions

Bonus Qualifications

Experience working on frontier video models or multimodal foundation models
Background in video diffusion, autoregressive video modeling, or world models
Experience at leading AI labs such as OpenAI, Google DeepMind, Google, ByteDance, Midjourney, or Adobe
Experience with large-scale dataset construction and filtering for video pretraining
Familiarity with robotics, embodied AI, or learning from egocentric / first-person video
Publication record in machine learning, computer vision, or multimodal AI

The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.

Responsibilities

Design and train large-scale video foundation models on diverse datasets spanning internet-scale video and robot-collected data

Develop pretraining strategies that capture temporal dynamics, motion, and object interaction from raw video sequences

Build models that learn transferable representations for downstream tasks such as perception, tracking, prediction, and control

Explore architectures for video understanding and generation, including transformer-based and diffusion-based approaches

Implement efficient data pipelines and training strategies for high-throughput video ingestion and large-scale distributed training

Optimize model performance across compute, memory, and training efficiency constraints

Collaborate closely with generative modeling, agent, and robot learning teams to integrate pretrained models into the autonomy stack

Design evaluation frameworks and benchmarks to measure temporal understanding, prediction quality, and generalization

Requirements

Experience training large-scale models on video data or other high-dimensional sequential modalities

Strong understanding of modern deep learning architectures for video, vision, or multimodal systems

Experience with large-scale pretraining, including dataset curation, training dynamics, and scaling laws

Proficiency in Python and deep learning frameworks such as PyTorch

Experience working with distributed training systems and large GPU clusters

Strong experimental rigor and ability to iterate quickly on model design and training strategies

Solid software engineering skills and ability to build scalable, reliable systems

Ability to operate independently and drive ambiguous, high-impact research directions

Bonus Qualifications

Experience working on frontier video models or multimodal foundation models

Background in video diffusion, autoregressive video modeling, or world models

Experience at leading AI labs such as OpenAI, Google DeepMind, Google, ByteDance, Midjourney, or Adobe

Experience with large-scale dataset construction and filtering for video pretraining

Familiarity with robotics, embodied AI, or learning from egocentric / first-person video

Publication record in machine learning, computer vision, or multimodal AI

Helix AI Engineer, Video Pretraining

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Responsibilities

Requirements

Bonus Qualifications

Responsibilities

Requirements

Bonus Qualifications