Language Engineer, Artificial General Intelligence - Data Services

Amazon Amazon · Big Tech · Cambridge, MA, United Kingdom · Data Science

The Language Engineer will develop diverse datasets for training and evaluating Amazon AI models, using synthetic data generation, model-supported generation, and human-in-the-loop methods. This role involves designing data collections, analyzing data, building tools for data creation, and collaborating with scientists and engineers to evaluate AI model performance. Experience with speech, text, and multimodal data is required.

What you'd actually do

  1. Design complex data collections with human participants in response to science needs: author instructions, define and implement quality targets and mechanisms, provide day-to-day coordination of data collection efforts (including planning, scheduling, and reporting), and be responsible for the final deliverables
  2. Design and conduct complex data creation tasks using synthetic and model-based data generation methods, following state-of-the-art approaches
  3. Analyze and extract insights from large amounts of data
  4. Build tools or tool prototypes for data analysis or data creation, using Python or another scripting language
  5. Use modeling tools to bootstrap or test new AI functionalities

Skills

Required

  • Master's or higher degree in a relevant field (Computational Linguistics or equivalent field with computational analysis)
  • 2+ years experience in computational linguistics or language data processing or AI data creation
  • Experience with language data annotation systems and other forms of data markup
  • Proficient with scripting languages, such as Python
  • Experience working with speech, text, and multimodal data in multiple languages
  • Excellent communication, strong organizational skills and very detailed oriented
  • Comfortable working in a fast paced, highly collaborative, dynamic work environment

Nice to have

  • PhD in Computational Linguistics (or equivalent field with computational emphasis)
  • Expertise in bootstrapping AI data collections for quickly evolving requirements
  • Extensive experience working with speech, text, and multimodal data in multiple languages
  • Experience in data creation for complex agentic workflows
  • Practical experience with Machine Learning
  • Familiarity with technical concepts such as APIs
  • Practical knowledge of version control and agile development
  • Familiarity with database queries and data analysis processes (SQL, R, Matlab, etc.)
  • Willingness to support several projects at one time, and to accept reprioritization as necessary
  • Able to think creatively and possess strong analytical and problem solving skills

What the JD emphasized

  • complex, multimodal datasets
  • synthetic data generation
  • model-supported data generation
  • human-in-the-loop data collections
  • state-of-the-art approaches
  • speech, text, and multimodal data in multiple languages
  • data creation for complex agentic workflows

Other signals

  • develops diverse datasets to train and evaluate AI models
  • synthetic data generation
  • model-supported data generation
  • human-in-the-loop data collections