Software Engineer, Data

HeyGen HeyGen · Multimodal · Los Angeles, CA +2 · Engineering

Software Engineer with data engineering responsibilities to build data foundational layers for next-generation features, enabling AI models to function in real-time and powering engaging user experiences. The role involves building and scaling data pipelines for multimedia data, powering intelligent features, architecting data lakehouse solutions, ensuring data reliability, and productizing data for AI agents.

What you'd actually do

  1. Build & Scale Data Pipelines: Design, develop, and maintain robust batch and real-time data pipelines (using Python, Go, Spark, Kafka) that ingest and transform massive multi-modal data—text, audio, and video—to train and run AI models.
  2. Power Intelligent Features: Collaborate with ML engineers to implement data structures and APIs for new, exciting features like PPT-to-video automation and interactive AI avatars that require low-latency data fetching.
  3. Data Lakehouse Infrastructure: Architect and manage data lakehouse solutions (e.g., Snowflake, Databricks, Apache Iceberg) to store and query unstructured media data efficiently, enhancing storage and computation efficiency.
  4. Data Reliability & Observability: Implement data quality checks, data contracts, and monitoring to ensure high reliability of data, preventing downtime in production video generation.
  5. Productize Data: Transform raw data into structured, actionable data products that can be easily consumed by front-end applications, API endpoints, and AI agents.

Skills

Required

  • Python
  • SQL
  • ETL
  • data modeling
  • cloud platforms (AWS/GCP)
  • Kafka
  • Spark
  • Snowflake/Databricks

Nice to have

  • Go
  • Computer Vision
  • Generative AI data processing

What the JD emphasized

  • massive multi-modal data
  • low-latency data fetching
  • AI models
  • AI agents

Other signals

  • data pipelines for AI models
  • multimedia data
  • AI avatars
  • data lakehouse