Member of Technical Staff - Pretraining Text Data

Microsoft Microsoft · Big Tech · Mountain View, CA +4 · Software Engineering

Seeking engineers and researchers to join the Pretraining Text Data team to build the next generation of foundation large language models. The role involves designing and curating high-quality text datasets, developing novel data collection strategies, improving dataset quality, understanding data-driven model behaviors, and aligning datasets with ethical values. Responsibilities include creating datasets for training and evaluation, developing scalable data pipelines, analyzing datasets, building tools for auditing, and collaborating with safety and ethics teams.

What you'd actually do

  1. Create high-quality datasets for training and evaluation; run experiments on new datasets (data ablations) to assess their impact and determine the most effective data.
  2. Develop and maintain scalable data pipelines for text data ingestion, preprocessing, filtering, and annotation.
  3. Analyze real-world text datasets to assess quality, diversity, relevance, and identify areas for improvement.
  4. Build lightweight tools and workflows for dataset auditing, visualization, and versioning.
  5. Collaborate with Safety, Ethics, and Governance teams to ensure datasets meet standards for quality, privacy, and responsible AI practices.

Skills

Required

  • Bachelor's Degree in AI, Computer Science, Data Science, Statistics, Physics, Engineering, or related technical discipline
  • 4+ years technical engineering experience
  • coding in languages including, but not limited to, Python
  • common data libraries (Pandas, NumPy, etc.)

Nice to have

  • Master's Degree in in AI, Computer Science, Data Science, Statistics, Physics, Engineering, or related technical discipline AND 8+ years technical engineering experience
  • Bachelor's Degree in AI, Computer Science, Data Science, Statistics, Physics, Engineering, or related technical discipline AND 12+ years technical engineering experience
  • 2+ years of experience in data analysis or data engineering
  • work with large-scale datasets that are unstructured or semi-structured
  • Proficiency in statistics and exploratory data analysis methods
  • Familiarity with data processing frameworks such as Spark, Ray, or Apache Beam
  • Ability to communicate technical findings clearly to research and product teams

What the JD emphasized

  • high-quality datasets
  • frontier AI models
  • novel data collection strategies
  • dataset quality and integrity
  • data-driven model behaviors
  • impact of data and data mixes
  • ethical and societal values
  • push the boundaries of what AI can learn from data
  • training and evaluation
  • scalable data pipelines
  • responsible AI practices

Other signals

  • foundation large language models
  • frontier AI models
  • curate, analyze, and evaluate diverse text datasets
  • develop novel data collection strategies
  • improve dataset quality and integrity
  • understand data-driven model behaviors
  • train models to understand the impact of data and data mixes
  • align datasets with ethical and societal values
  • push the boundaries of what AI can learn from data
  • create high-quality datasets for training and evaluation
  • run experiments on new datasets (data ablations) to assess their impact
  • develop and maintain scalable data pipelines for text data ingestion, preprocessing, filtering, and annotation
  • analyze real-world text datasets to assess quality, diversity, relevance, and identify areas for improvement
  • build lightweight tools and workflows for dataset auditing, visualization, and versioning
  • collaborate with Safety, Ethics, and Governance teams to ensure datasets meet standards for quality, privacy, and responsible AI practices