Member of Technical Staff -member of Technical Staff - Pretraining Text Data

Microsoft Microsoft · Big Tech · London, United Kingdom +2 · Software Engineering

Seeking engineers and researchers to join the Pretraining Text Data team to build the next generation of foundation large language models. The role focuses on designing and curating high-quality datasets, developing novel data collection strategies, improving dataset quality and integrity, understanding data-driven model behaviors, training models on data impact, and aligning datasets with ethical and societal values. This is a cross-disciplinary, high-impact role at the intersection of data and innovation.

What you'd actually do

  1. Create high-quality datasets for training and evaluation; run experiments on new datasets (data ablations) to assess their impact and determine the most effective data.
  2. Develop and maintain scalable data pipelines for text data ingestion, preprocessing, filtering, and annotation.
  3. Analyze real-world text datasets to assess quality, diversity, relevance, and identify areas for improvement.
  4. Build lightweight tools and workflows for dataset auditing, visualization, and versioning.
  5. Collaborate with Safety, Ethics, and Governance teams to ensure datasets meet standards for quality, privacy, and responsible AI practices.

Skills

Required

  • Bachelor's Degree in AI, Computer Science, Data Science, Statistics, Physics, Engineering, or related technical discipline
  • technical engineering experience with coding in languages including, but not limited to, Python and common data libraries (Pandas, NumPy, etc.)

Nice to have

  • Master's Degree in in AI, Computer Science, Data Science, Statistics, Physics, Engineering, or related technical discipline
  • 12+ years technical engineering experience with coding in languages including, but not limited to, Python and common data libraries (Pandas, NumPy, etc.)
  • 2+ years of experience in data analysis or data engineering, including work with large-scale datasets that are unstructured or semi-structured.
  • Proficiency in statistics and exploratory data analysis methods.

What the JD emphasized

  • high-quality datasets
  • frontier AI models
  • data-driven model behaviors
  • ethical and societal values
  • responsible AI practices

Other signals

  • curating high-quality datasets
  • powering frontier AI models
  • data-driven model behaviors
  • responsible AI practices