Member of Technical Staff, Synthetic Data

Cohere Cohere · AI Frontier · Toronto, ON · Modeling

Cohere is seeking a Machine Learning Engineer specializing in synthetic data to develop and manage the synthetic data pipeline for their advanced language models. This role involves end-to-end management of synthetic data, including pipeline optimization, data analysis and generation, and model evaluation. The engineer will work with diverse web and code data, transforming it using generative models to improve token efficiency and model quality, bridging research and engineering to enhance training metrics like throughput and accelerator utilization.

What you'd actually do

  1. Design and build scalable inference pipelines that run on large GPU clusters.
  2. Conduct data ablations to assess data quality and experiment with data mixtures to enhance model performance.
  3. Research and implement innovative synthetic data curation methods, leveraging Cohere’s infrastructure to drive advancements in natural language processing.
  4. Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models.

Skills

Required

  • Strong software engineering skills
  • proficiency in Python
  • experience building data pipelines
  • experience working with LLMs
  • experience working with large-scale datasets
  • web data
  • code data
  • multilingual corpora

Nice to have

  • Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools.
  • Familiarity with LLM inference frameworks such as vLLM and TensorRT.
  • A passion for bridging research and engineering to solve complex data-related challenges in AI model training.
  • paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).

What the JD emphasized

  • scalable inference pipelines
  • large GPU clusters
  • data ablations
  • model performance
  • synthetic data curation methods
  • advancements in natural language processing
  • cutting-edge language models

Other signals

  • synthetic data generation
  • data pipeline optimization
  • model evaluation