Machine Learning Systems Engineer, Encodings and Tokenization

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Machine Learning Systems Engineer focused on developing and optimizing encodings and tokenization systems for Anthropic's Finetuning workflows, acting as a bridge between Pretraining and Finetuning teams. This role is crucial for improving model training efficiency and performance, enabling researchers to experiment with new tokenization methods, and ensuring the reliability and interpretability of AI systems.

What you'd actually do

  1. Design, develop, and maintain tokenization systems used across Pretraining and Finetuning workflows
  2. Optimize encoding techniques to improve model training efficiency and performance
  3. Collaborate closely with research teams to understand their evolving needs around data representation
  4. Build infrastructure that enables researchers to experiment with novel tokenization approaches
  5. Implement systems for monitoring and debugging tokenization-related issues in the model training pipeline

Skills

Required

  • Python
  • Machine learning systems
  • Data pipelines
  • ML infrastructure
  • Modern ML development practices
  • Analytical skills

Nice to have

  • Machine learning data processing pipelines
  • Data encodings for ML applications
  • BPE, WordPiece, or other tokenization algorithms
  • Performance optimization of ML data processing systems
  • Multi-language tokenization
  • Distributed systems
  • Parallel computing for ML workflows
  • Large language models
  • Transformer-based architectures

What the JD emphasized

  • 8+ years of software engineering experience
  • significant software engineering experience with demonstrated machine learning expertise

Other signals

  • Develop and optimize tokenization systems for pretraining and finetuning workflows
  • Build infrastructure for researchers to experiment with novel tokenization approaches
  • Implement systems for monitoring and debugging tokenization-related issues