Member of Technical Staff, Pre-training Data

Cohere Cohere · AI Frontier · Toronto, ON · Modeling

Cohere is seeking a Machine Learning Engineer specializing in pretraining data to develop data pipelines for their advanced language models. The role involves conducting data ablations, evaluating data quality, and constructing pre-training data mixtures to enhance model performance, directly contributing to improvements in training metrics like throughput and accelerator utilization.

What you'd actually do

  1. Conduct data ablations to assess data quality and experiment with data mixtures to enhance model performance.
  2. Develop robust data modeling techniques to ensure datasets are structured and formatted for optimal training efficiency.
  3. Research and implement innovative data curation methods, leveraging Cohere’s infrastructure to drive advancements in natural language processing.
  4. Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models.

Skills

Required

  • Python
  • data pipelines
  • curriculum learning
  • data mixing
  • data attribution
  • data processing frameworks (Apache Spark, Apache Beam, Pandas, or similar)
  • large-scale datasets
  • web data
  • code data
  • multilingual corpora
  • data quality assessment techniques
  • experimentation with data mixtures

Nice to have

  • paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)

What the JD emphasized

  • pre-training data
  • data ablations
  • data mixtures
  • model performance
  • training efficiency
  • data quality
  • large-scale datasets
  • web data
  • code data
  • multilingual corpora

Other signals

  • pre-training data
  • data mixtures
  • model performance
  • data quality
  • training efficiency