Research Scientist, Data

Pika Labs Pika Labs · AI Frontier · Palo Alto, CA · Research

Pika Labs is seeking a Research Engineer, Data to architect and scale data engineering systems for their multimodal foundation models. The role involves building, optimizing, and owning large-scale data pipelines for data curation, cleaning, labeling, filtering, augmentation, and storage to support model training and research workflows across text, image, audio, and video.

What you'd actually do

  1. Take ownership of large-scale data pipeline architecture and implementation to support model training and research workflows for text, image, audio, and video datasets
  2. Partner with research and engineering teams to curate, clean, and manage diverse, sensory-rich datasets for pre-training and mid-training of multimodal models
  3. Develop strategies and tools for scalable data ingestion, labeling, filtering, augmentation, and storage
  4. Ensure data quality, reliability, and compliance, including managing privacy and ethical considerations throughout the data lifecycle
  5. Optimize data processing, transformation, and delivery for large-scale distributed training pipelines

Skills

Required

  • 5+ years of experience building and scaling data pipelines for machine learning applications
  • Strong background in data engineering and ML data curation for LLMs, VLMs, or other large-scale multimodal models
  • Expertise in distributed data systems (e.g., Spark, Hadoop, Ray, or similar) and efficient large dataset processing/ETL workflows
  • Proven ability to build robust, scalable, and production-grade data infrastructure for ML pipelines
  • Experience developing tools for data labeling, filtering, deduplication, quality assurance, and dataset management
  • Strong programming skills (Python, SQL, PySpark, or similar)
  • familiarity with cloud data platforms (AWS, GCP, Azure)
  • Knowledge of privacy, compliance, ethics, and best practices in data collection and management
  • Excellent cross-functional collaboration, problem-solving, and communication skills

Nice to have

  • passion for powerful data infrastructure and innovative research-engineering
  • passion for enabling cutting-edge generative AI and creative technology through data excellence

What the JD emphasized

  • large-scale data pipeline architecture
  • ML data curation
  • pre-training
  • multimodal models
  • data quality, reliability, and compliance
  • privacy and ethical considerations
  • large-scale distributed training pipelines
  • dataset creation, management
  • production-ready systems
  • data engineering
  • ML data management

Other signals

  • multimodal foundation models
  • large-scale data pipelines
  • ML data curation
  • real-time generation
  • intelligent agentic platforms