Machine Learning Systems Engineer, Research Tools

Anthropic Anthropic · AI Frontier · New York, NY +1 · AI Research & Engineering

Machine Learning Systems Engineer focused on developing and optimizing encodings and tokenization systems for Anthropic's Finetuning workflows. This role acts as a bridge between Pretraining and Finetuning teams, building infrastructure crucial for model learning and data interpretation, impacting research progress and efficiency.

What you'd actually do

  1. Design, develop, and maintain tokenization systems used across Pretraining and Finetuning workflows
  2. Optimize encoding techniques to improve model training efficiency and performance
  3. Collaborate closely with research teams to understand their evolving needs around data representation
  4. Build infrastructure that enables researchers to experiment with novel tokenization approaches
  5. Implement systems for monitoring and debugging tokenization-related issues in the model training pipeline

Skills

Required

  • Python
  • Machine learning systems
  • Data pipelines
  • ML infrastructure
  • Modern ML development practices
  • Analytical skills
  • Evaluate impact of engineering changes on research outcomes

Nice to have

  • Machine learning data processing pipelines
  • Data encodings for ML applications
  • BPE, WordPiece, or other tokenization algorithms
  • Performance optimization of ML data processing systems
  • Multi-language tokenization challenges and solutions
  • Research environments where engineering directly enables scientific progress
  • Distributed systems and parallel computing for ML workflows
  • Large language models or other transformer-based architectures

What the JD emphasized

  • significant software engineering experience with demonstrated machine learning expertise
  • machine learning systems, data pipelines, or ML infrastructure
  • Python
  • modern ML development practices
  • analytical skills
  • evaluate the impact of engineering changes on research outcomes
  • machine learning data processing pipelines
  • data encodings for ML applications
  • tokenization algorithms
  • Performance optimization of ML data processing systems
  • Multi-language tokenization challenges and solutions
  • Research environments where engineering directly enables scientific progress
  • Large language models or other transformer-based architectures

Other signals

  • Develop and optimize encodings and tokenization systems for Finetuning workflows
  • Bridge between Pretraining and Finetuning teams
  • Build critical infrastructure impacting model learning and data interpretation