Data Engineer II - Genai

Booking Booking · Hospitality · Tel Aviv, Israel · Data Engineering

Data Engineer II - GenAI role at Booking.com focused on building and optimizing data pipelines for training and optimizing content models, including GenAI foundation models and supervised fine-tuning. The role involves working with large datasets (petabytes) from various sources to ensure high-quality data for ML platforms and downstream applications, collaborating with data scientists and engineers.

What you'd actually do

  1. Rapidly developing next-generation scalable, flexible, and high-performance data pipelines.
  2. Dealing with massive textual sources to train GenAI foundation models.
  3. Solving issues with data and data pipelines, prioritizing based on customer impact.
  4. End-to-end ownership of data quality in our core datasets and data pipelines.
  5. Experimenting with new tools and technologies to meet business requirements regarding performance, scaling, and data quality.

Skills

Required

  • Python
  • Java
  • Pyspark
  • Apache Flink
  • Snowflake
  • MySQL
  • Cassandra
  • DynamoDB
  • Data Warehousing
  • ETL/ELT pipelines
  • production data pipelines
  • data-lake
  • server-less solutions
  • schema design
  • data modeling

Nice to have

  • experience in data processing for large-scale language models like GPT, BERT, or similar architectures
  • NumPy
  • pandas
  • matplotlib
  • experimental design
  • A/B testing
  • evaluation metrics for ML models
  • working on products that impact a large customer base

What the JD emphasized

  • Minimum of 3 years of experience as a Data Engineer or a similar role, with a consistent record of successfully delivering ML/Data solutions.
  • You have built production data pipelines in the cloud, setting up data-lake and server-less solutions; ‌ you have hands-on experience with schema design and data modeling and working with ML scientists and ML engineers to provide production level ML solutions.

Other signals

  • data pipelines
  • GenAI foundation models
  • ML platforms
  • petabytes of data
  • ML scientists