Web Crawling Engineer

Mistral AI Mistral AI · AI Frontier · Paris, France · Engineering & Infra

This role focuses on developing and maintaining web crawlers using Go, utilizing headless browsing, and managing distributed job queues to extract and process large-scale data from diverse web sources. It involves collaboration with cross-functional teams and ensuring data quality and integrity.

What you'd actually do

  1. Developing and maintaining web crawlers using Go to extract data from target websites.
  2. Utilize headless browsing techniques, such as Chrome DevTools, to automate and optimize data collection processes.
  3. Collaborate with cross-functional teams to identify, scrape, and integrate data from APIs and web pages to support business objectives.
  4. Create and implement efficient parsing patterns using tokenizers, regular expressions, XPaths, and CSS selectors to ensure accurate data extraction.
  5. Design and manage distributed job queues using technologies such as Redis, Aerospike and Kubernetes to handle large-scale distributed crawling and processing tasks.

Skills

Required

  • Go (Golang)/Rust/Zig
  • TCP, UDP, TLS and HTTP/1.1,2,3 protocols
  • HTML, CSS, and JavaScript
  • cloud platforms (AWS, GCP)
  • orchestration (Kubernetes, Nomad)
  • containerization (Docker)
  • queues, stacks, hash maps
  • networking and web scraping libraries
  • SQL and/or NoSQL databases
  • distributed systems (e.g., Hadoop, Spark)

Nice to have

  • Aerospike
  • web archiving projects & tooling
  • Machine Learning to improve crawling efficiency or accuracy
  • low-level networking programming and/or userspace TCP/IP stacks