Manager, Software Engineering, ML Inference

Snap Snap · Consumer · Palo Alto, CA +3

Manager of ML Infrastructure engineers responsible for building and scaling systems for model training, inference, and data pipelines. The role involves setting strategy, creating roadmaps, mentoring engineers, and collaborating with product teams to deliver solutions at scale. Emphasis on availability, scalability, operational excellence, and cost management.

What you'd actually do

  1. Lead and mentor a team of ML Infrastructure engineers responsible for building and scaling the systems that power Snap's model training, inference, and data pipelines
  2. Set the strategy, build a roadmap, create measurable goals, and lead your team to deliver high-impact ML infrastructure initiatives
  3. Evaluate the technical tradeoffs of key decisions and serve as a strong technical mentor across the team
  4. Perform design and code reviews to continuously raise the technical excellence bar
  5. Collaborate with ML engineers, product teams, and cross-functional stakeholders to understand requirements, evaluate tradeoffs, and deliver solutions at scale

Skills

Required

  • Strong understanding of ML infrastructure systems including model training platforms, inference serving, feature stores, and data pipelines
  • Background building high availability, mission-critical systems at significant scale
  • Experience setting technical direction for teams whose work directly enables ML engineers and production ML systems
  • Strong management and mentorship skills, with the ability to lead and grow senior engineers
  • Excellent verbal and written communication skills, with high attention to detail
  • Ability to collaborate with internal and external stakeholders at all levels
  • Skilled at managing ambiguous problems and driving clarity across complex, multi-team initiatives
  • Proficiency in, or a strong aptitude for, leveraging AI tools to streamline development, paired with the critical judgment to audit generated output for architectural integrity, performance bottlenecks, and security risks
  • Adaptability in learning and applying evolving AI systems and tools to remain at the forefront of engineering trends and modern development practices
  • Bachelor's degree in a technical field such as computer science or equivalent years of experience
  • 9+ years of post-Bachelor's software engineering experience; or a Master's degree in a technical field + 8+ years of post-grad experience; or a PhD in a related technical field + 5+ years of post-grad experience
  • 1+ year(s) of experience managing an engineering team
  • Experience with distributed systems and large-scale ML infrastructure

Nice to have

  • Advanced degree in a related technical field
  • Experience working with ML training platforms, inference infrastructure, or feature serving systems
  • Familiarity with ML frameworks such as TensorFlow, PyTorch, Caffe2, Spark ML, or related frameworks
  • Experience with Spark, Flink, Ray, or other big data processing technologies
  • Experience with key infrastructure technologies including Kubernetes, NoSQL, Memcache/Redis, Kafka, Google Cloud, or AWS services
  • Track record of delivery in rapidly changing, highly collaborative, multi-stakeholder environments
  • Experience with MLOps and managing production machine learning lifecycle

What the JD emphasized

  • responsible for building and scaling the systems
  • deliver high-impact ML infrastructure initiatives
  • deliver solutions at scale
  • high availability, mission-critical systems at significant scale
  • setting technical direction for teams whose work directly enables ML engineers and production ML systems

Other signals

  • ML infrastructure engineers
  • building and scaling systems
  • model training, inference, and data pipelines
  • deliver high-impact ML infrastructure initiatives
  • deliver solutions at scale