Research Engineer, Clio

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Machine Learning Systems Engineer to join the Encodings and Tokenization team, focusing on developing and optimizing tokenization systems for Pretraining and Finetuning workflows. This role builds infrastructure impacting model learning and data interpretation, bridging Pretraining and Finetuning teams.

What you'd actually do

  1. Design, develop, and maintain tokenization systems used across Pretraining and Finetuning workflows
  2. Optimize encoding techniques to improve model training efficiency and performance
  3. Collaborate closely with research teams to understand their evolving needs around data representation
  4. Build infrastructure that enables researchers to experiment with novel tokenization approaches
  5. Implement systems for monitoring and debugging tokenization-related issues in the model training pipeline

Skills

Required

  • Python
  • machine learning systems
  • data pipelines
  • ML infrastructure
  • analytical skills

Nice to have

  • machine learning data processing pipelines
  • data encodings for ML applications
  • BPE, WordPiece, or other tokenization algorithms
  • Performance optimization of ML data processing systems
  • Multi-language tokenization challenges and solutions
  • Distributed systems and parallel computing for ML workflows
  • Large language models or other transformer-based architectures

What the JD emphasized

  • 8+ years of software engineering experience
  • significant software engineering experience with demonstrated machine learning expertise

Other signals

  • Developing and optimizing encodings and tokenization systems
  • Bridge between Pretraining and Finetuning teams
  • Build critical infrastructure that directly impacts how our models learn from and interpret data