Research Engineer, Multimodal

Character AI Character AI · AI Frontier · Redwood City, CA · Technical Staff - ML

Research Engineer focused on multimodal AI, specifically video and image generation models. The role involves training, fine-tuning, and deploying these models, including joint audio-visual generation and image-to-video. It also includes designing new architectures, optimizing models for inference, and building large-scale data pipelines. The company is a consumer-facing AI platform.

What you'd actually do

  1. Lead fine-tuning and continued training of video generation models, including image-to-video and joint audio-visual generation.
  2. Design and experiment with novel model architectures for multimodal generation, including multimodal conditioning (voice, structured text, reference images).
  3. Leverage techniques such as LoRA, RLHF, and full-parameter fine-tuning to improve model quality across diverse visual scenarios.
  4. Design and build large-scale data pipelines and automated annotation workflows to support continuous model improvement.
  5. Explore model compression, inference acceleration, and serving optimizations to enable efficient real-time video processing at scale.

Skills

Required

  • PyTorch
  • video generation architectures
  • image generation architectures
  • diffusion models
  • DiT
  • ControlNet
  • multimodal model training
  • distributed training tools
  • large-scale data processing
  • dataset construction
  • automated data cleaning

Nice to have

  • joint audio-visual generation
  • speech-conditioned generation models
  • AIGC
  • video effects
  • character animation
  • asset generation products
  • ML deployment
  • orchestration
  • Kubernetes
  • Slurm
  • Docker
  • cloud platforms
  • Publications in relevant venues

What the JD emphasized

  • video generation models
  • image generation models
  • audio generation models
  • multimodal model training
  • large-scale data processing
  • real-time video processing

Other signals

  • video generation models
  • image generation models
  • audio generation models
  • multimodal generation
  • large-scale data pipelines
  • real-time video processing