Senior Applied ML Scientist – Generative Video

Apple Apple · Big Tech · Cupertino, CA · Software and Services

This role focuses on researching, designing, and training state-of-the-art generative video models, primarily diffusion-based, with applications for creative users. It involves exploring novel architectures, spatiotemporal modeling, and multi-modal conditioning, aiming for real-world product impact.

What you'd actually do

  1. Design and train state-of-the-art generative video models, primarily based on diffusion, consistency, rectified flow, or related generative frameworks
  2. Explore novel architectures for spatiotemporal modeling (e.g., 3D U-Nets, DiT-style Transformers, hybrid CNN-Transformer models)
  3. Conduct experiments on long-range temporal coherence, motion consistency, controllability, and multi-modal conditioning (text, audio, images)

Skills

Required

  • MS in Computer Science, Machine Learning, or a related field, or equivalent practical experience
  • deep learning for generative models
  • diffusion-based methods
  • distributed training of large models using PyTorch
  • video representations
  • spatiotemporal modeling
  • neural network optimization
  • multi-modal DiT-style Transformers
  • latent diffusion
  • multi-stage video generation pipelines
  • adapters

Nice to have

  • PhD with research focused on generative modeling, diffusion models, or video understanding
  • text-to-video generation
  • image-to-video generation
  • video-to-video generation
  • Publications in top-tier ML conferences

What the JD emphasized

  • 4+ years of experience in deep learning for generative models, particularly diffusion-based methods
  • Experience designing & training multi-modal DiT-style Transformers, latent diffusion, or multi-stage video generation pipelines

Other signals

  • generative video modeling
  • diffusion-based methods
  • state-of-the-art generative video models
  • novel architectures for spatiotemporal modeling
  • long-range temporal coherence
  • controllability
  • multi-modal conditioning