Senior Language Engineer, Artificial General Intelligence - Data Services

Amazon Amazon · Big Tech · Boston, MA · Data Science

This role focuses on developing diverse datasets for training and evaluating AI models, utilizing synthetic data generation, model-based generation, and human-in-the-loop approaches. The Senior Language Engineer will define data creation strategies, lead complex data collections, and analyze large datasets. They will also build tools for data analysis and creation, and collaborate with scientists to evaluate AI model performance.

What you'd actually do

  1. Define and lead the organization's data creation strategies for our science partners
  2. Design and lead complex data collections with human participants in response to science needs: author instructions, define and implement quality targets and mechanisms, provide day-to-day coordination of data collection efforts (including planning, scheduling, and reporting), and be responsible for the final deliverables
  3. Design and conduct complex data creation tasks using synthetic and model-based data generation methods, following state-of-the-art approaches
  4. Analyze and extract insights from large amounts of data
  5. Build tools or tool prototypes for data analysis or data creation, using Python or another scripting language

Skills

Required

  • Master's degree in a relevant field (Computational Linguistics or equivalent field with computational analysis)
  • Python or another scripting language
  • Machine Learning training and evaluations

Nice to have

  • PhD in Computational Linguistics (or equivalent field with computational emphasis)
  • speech, text, and multimodal data
  • multiple languages
  • APIs
  • version control
  • agile development procedures
  • database queries
  • data analysis processes (SQL, R, Matlab, etc.)

What the JD emphasized

  • 5+ years of experience creating AI datasets for complex and quickly evolving requirements using a range of approaches: model-based, human in the loop, synthetic/code-based, etc.
  • 5+ years of experience working with speech, text, and multimodal data, including in multiple languages
  • 5+ years of experience defining and leading cross-team data creation strategies for long-term science customers
  • 5+ years of experience with Machine Learning training and evaluations, specifically regarding the types of data needeed for different training types

Other signals

  • developing diverse datasets to train and evaluate AI models
  • synthetic data generation
  • model-based data generation
  • human-in-the-loop data collections
  • multimodal datasets