Google's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. Our products need to handle information at massive scale, and extend well beyond web search. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google’s needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.
Platforms Infrastructure Engineering operates within the Google Cloud umbrella. We provide the AI/ML infrastructure on which Google runs - both internally and externally.
Large-scale ML training requires a huge infrastructure footprint, all of which is connected by a equally large and dense networking infrastructure. Join us to directly enable this next generation of Google's AI infrastructure. Mission is finding innovative ways to increase availability, reduce risk to production traffic, and more efficiently operate the network that enables large-scale training and serving.
As the lead for this team, you will set the long-term technical roadmap to improve safety, increase observability, improve automated remediation, ensuring that nearly all of Google's customers run with maximum availability possible.
The AI and Infrastructure team is redefining what’s possible. We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and velocity. Our customers include Googlers, Google Cloud customers, and billions of Google users worldwide.
We're the driving force behind Google's groundbreaking innovations, empowering the development of our cutting-edge AI models, delivering unparalleled computing power to global services, and providing the essential platforms that enable developers to build the future. From software to hardware our teams are shaping the future of world-leading hyperscale computing, with key teams working on the development of our TPUs, Vertex AI for Google Cloud, Google Global Networking, Data Center operations, systems research, and much more.
****Individual pay is determined by factors including job-related skills, experience, and relevant education or training.
US: $207000 - $301000 (USD) + 20% bonus target + equity + benefits
Learn more about benefits at Google.
Responsibilities
- Define the long-term goal for repair automation of AI/ML infrastructure, focusing on achieving goals through multiple parallel programs.
- Lead and participate in the design of agentic diagnostic systems that utilize Generative AI to automate diagnoses for next-gen networks.
- Work with platform teams to integrate new hardware platforms into the automation ecosystem, driving the qualification and repair workflows required for global fleet turn-up.
- Lead critical safety initiatives, such as automated anomaly detection, to protect fleet health and capacity.
- Mentor a team of junior and executive engineers and influence engineering practices across the broader infrastructure organization to drive consistency in automation and safety standards.
Qualifications
Minimum qualifications:
- Bachelor’s degree or equivalent practical experience.
- 8 years of experience in software development.
- 5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture.
- 5 years of experience with one or more of the following: Speech/audio (e.g., technology duplicating and responding to the human voice), reinforcement learning (e.g., sequential decision making), ML infrastructure, or specialization in another ML field.
- 5 years of experience with ML design and ML infrastructure (e.g., model deployment, model evaluation, data processing, debugging, fine tuning).
- Experience integrating generative AI tools or LLM interfaces into workflows.
Preferred qualifications:
- Master’s degree or PhD in Engineering, Computer Science, or a related technical field.
- 8 years of experience with data structures and algorithms.
- 3 years of experience in a technical leadership role leading project teams and setting technical direction.
- Experience with any of the following: SQL Pipelines, Plx Scripts, Generative AI Agents.
- Track record of leading complex infrastructure projects.
- Ability to influence technical direction across a partner teams (repair infrastructure, network, machines, all coexist together), and improve engineering practices.