Engineering Manager, ML Infrastructure Control Plane

Google Google · Big Tech · Sunnyvale, CA +1

Google is seeking an Engineering Manager for their ML Infrastructure Control Plane team. This role involves leading engineering efforts for Alphabet's ML workloads, supporting critical missions across various Google divisions. The team focuses on providing efficient, reliable, and easy-to-use fleet-wide scheduling for ML infrastructure. The manager will set team priorities, develop technical roadmaps, guide system designs, and oversee code development and reviews, ensuring the efficient operation of large-scale ML infrastructure.

What you'd actually do

  1. Set and communicate team priorities that support the broader organization's goals. Align strategy, processes, and decision-making across teams.
  2. Set clear expectations with individuals based on their level and role and aligned to the broader organization's goals. Meet regularly with individuals to discuss performance and development and provide feedback and coaching.
  3. Develop the mid-term technical goal and roadmap within the scope of your teams. Evolve the roadmap to meet anticipated future requirements and infrastructure needs.
  4. Design, guide and vet systems designs within the scope of the broader area, and write product or system development code to solve ambiguous problems.
  5. Review code developed by other engineers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency).

Skills

Required

  • software development
  • developing large-scale infrastructure, distributed systems or networks, or experience with compute technologies, storage or hardware architecture
  • technical leadership role
  • people management or team leadership role

Nice to have

  • Master's degree or PhD in Computer Science or related technical field
  • working in a complex, matrixed organization
  • infrastructure/ML
  • Understanding of the end-to-end ML development lifecycle

What the JD emphasized

  • ML infrastructure
  • fleet-wide scheduling
  • Alphabet’s ML workloads
  • end-to-end ML development lifecycle

Other signals

  • ML infrastructure
  • fleet-wide scheduling
  • Alphabet's ML workloads
  • DeepMind, AdBrain, Search, Waymo, and Cloud
  • hyperscale computing
  • Vertex AI