Staff Software Engineer, ML Infrastructure, Level 6

Snap Snap · Consumer · Bellevue, WA +3

Staff Software Engineer, ML Infrastructure role focused on scaling ML infrastructure, optimizing embedding, feature, and training data storage/compute for massive scale ML models. Responsibilities include designing and optimizing infrastructure systems, developing high-performance embedding generation/batch inference systems, and building data management systems for scalable data collection, labeling, processing, and evaluation. The role also involves integrating ML data quality systems and working with ML engineers to deploy models into production, while utilizing AI tools for development.

What you'd actually do

  1. Design and optimize infrastructure systems for machine learning workloads at scale and drive reliability and efficiency improvements across Snapchat’s ML Infrastructure
  2. Develop high-performance embedding generation / batch inference systems to improve model performance
  3. Develop high-performance data storage/compute systems to improve the efficiency of our ML infrastructure
  4. Integrate state of the art ML data quality system to assure model performance
  5. Build comprehensive data management systems for scalable data collection, labeling, processing, and evaluation

Skills

Required

  • Strong programming skills in Python, Java, Scala, or C++
  • Strong problem-solving skills with a focus on system performance, scalability, and efficiency
  • Good understanding of distributed systems and the infrastructure components of large-scale ML
  • Ability to collaborate and work well with others
  • Proven track record of operating highly-available systems at significant scale
  • Ability to proactively learn new concepts and apply them at work
  • Proficiency in, or a strong aptitude for, leveraging AI tools to streamline development, paired with the critical judgment to audit generated output for architectural integrity, performance bottlenecks, and security risks.
  • Adaptability in learning and applying evolving AI systems and tools to remain at the forefront of engineering trends and modern development practices
  • Bachelor’s degree in a technical field such as computer science or equivalent experience
  • 9+ years of post-Bachelor’s software development experience; or Master’s degree in a technical field + 5+ years of post-grad software development experience; or PhD in a relevant technical field+ 2+ years of post-grad software development experience
  • Experience building large scale production machine learning systems, distributed systems or big data processing

Nice to have

  • Masters/PhD in a technical field such as computer science or equivalent industry experience
  • Experience with big data processing frameworks such as Spark, Flink, or Ray
  • Experience with large scale feature store or embedding system
  • Familiarity with ML frameworks such as Pytorch, Tensorflow

What the JD emphasized

  • operating highly-available systems at significant scale
  • Experience building large scale production machine learning systems, distributed systems or big data processing

Other signals

  • ML Infrastructure
  • embedding generation
  • batch inference systems
  • data storage/compute systems
  • ML data quality system
  • data management systems
  • deploying models into production
  • AI tools