Software Engineer, Data Ingestion

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Software Engineer to join the Data Acquisition team, which owns the problem of acquiring all of the available data on the internet through a large scale web crawler, and through data partnerships. The role involves developing and maintaining internet-scale web crawlers, building data ingestion pipelines, and improving system observability. This is crucial for producing the best pre-trained models.

What you'd actually do

  1. Develop and maintain our large-scale web crawler
  2. Build pipelines for data ingestion, analysis, and quality improvement
  3. Build specialized crawlers for high-value data sources
  4. Build tools for improving the observability and debuggability of crawler system
  5. Collaborate with team members on improving data acquisition processes

Skills

Required

  • Python
  • building and running large distributed systems
  • Cloud-based compute and storage solutions

Nice to have

  • Familiarity with the non-technical tradeoffs of internet-scale crawling (data privacy, robots.txt adherence, etc.)
  • Technical expertise: Quickly understanding systems design tradeoffs, keeping track of rapidly evolving software systems

What the JD emphasized

  • internet scale web crawler
  • large scale system to acquire all openly accessible information on the internet
  • building and running large distributed systems
  • internet-scale crawling

Other signals

  • The team’s responsibilities are as follows: Develop and maintain an internet scale web crawler responsible for crawling for accessible internet data
  • Build required pipelines to quickly ingest data from potential partners for data quality assessments or from other sources
  • Successfully scaling our data corpus is critical to our continued efforts at producing the best pretrained models.