Sre Manager, ML Operations

Apple Apple · Big Tech · New York, NY +1 · Software and Services

This role is for an SRE Manager focused on ML Operations for Apple's Ad Serving infrastructure. The manager will lead and scale SRE teams responsible for the reliability, performance, and availability of ML Platforms and Services, ensuring operational excellence and continuous improvement. The role requires deep technical expertise in large-scale distributed systems and SRE principles, with a focus on managing and optimizing ML systems at scale.

What you'd actually do

  1. Lead and scale SRE teams responsible for the reliability, performance, and availability of ML Platforms and Services
  2. Grow and develop your engineers through mentorship, clear goal-setting, and meaningful career development
  3. Define a compelling team vision and drive execution toward high-quality, measurable outcomes
  4. Champion reliability engineering best practices including SLOs/SLAs, error budgets, observability, incident management, and fault analysis
  5. Foster a culture of engineering excellence -- encouraging innovation, knowledge sharing, and continuous improvement

Skills

Required

  • 10+ years of experience with large-scale distributed systems
  • 5+ years of experience in an engineering leadership role, ideally managing SRE or Production Engineering teams
  • Proven track record of building and leading high-performing engineering teams
  • Strong grasp of core operating system principles, networking fundamentals, and systems management
  • Deep understanding of SRE principles: monitoring, alerting, error budgets, fault analysis, capacity planning, and incident response
  • Excellent problem-solving, communication, and decision-making skills

Nice to have

  • Bachelor's or Master's degree in Computer Science or a related field
  • Experience managing and optimizing GPU-based clusters in production environments
  • Experience building and operating large-scale ML systems or ML infrastructure at scale
  • Hands-on experience managing cloud infrastructure, particularly AWS
  • Familiarity with the digital advertising ecosystem and its technical demands
  • Demonstrated ability to influence and partner across Product, Data Science, and Platform Engineering organizations

What the JD emphasized

  • ML Operations
  • ML Platforms and Services
  • large-scale distributed systems
  • SRE principles
  • ML systems

Other signals

  • ML Operations
  • Ad Serving infrastructure
  • ML Platforms and Services
  • large-scale distributed systems
  • SRE principles