Senior Member of Technical Staff, Web Data

Cohere Cohere · AI Frontier · Toronto, ON · Modeling

Cohere is seeking a Senior Member of Technical Staff to develop large-scale web data pipelines for pre-training language models. This role involves transforming raw internet data into high-quality training data, owning data pipeline components, analyzing data composition and quality, and collaborating with research and evaluation teams.

What you'd actually do

  1. Maintain large-scale pipelines for processing web corpora.
  2. Work on filtering and quality-scoring systems to identify high-value web documents.
  3. Analyze web data composition across domains, languages and time periods.
  4. Develop and maintain highly-performant deduplication pipelines.
  5. Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models.

Skills

Required

  • Strong software engineering skills
  • Python proficiency
  • building data pipelines
  • Apache Spark
  • Apache Beam
  • Pandas
  • large-scale web datasets
  • data quality assessment techniques
  • experimentation with data mixtures

Nice to have

  • paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)

What the JD emphasized

  • large scale web data pipeline
  • pre-training
  • large-scale web corpora
  • high-quality training data
  • data pipeline
  • web data composition
  • model performance
  • training corpus

Other signals

  • training data
  • pretraining
  • large-scale web corpora
  • data pipeline