System Hardware Reliability Engineer at Google

What you'd actually do

Lead analysis of system hardware designs to enable proactive design evaluations and product de-risk at an early stage of development.

Lead system reliability efforts by working with other organizations to define reliability goals and reliability plans, securing the resources needed to execute the plan.

Implement the reliability plan and lead all efforts to assess and mitigate risk of failure early during New Product Introduction (NPI).

Drive reliability test plans and collect, analyze, and synthesize the test data to enable verification of the design reliability goals.

Lead system reliability monitoring efforts (availability, repair trends) and proactively alert product teams on unwanted system behavior, working on mitigation strategy definition and implementation.

Skills

Required

Bachelor's degree in Reliability, Electrical, Industrial, or Mechanical Engineering, or equivalent practical experience.
5 years of experience in manufacturing.
Experience with failure analysis and fault isolation techniques and how to apply them to find root causes of failure.

Nice to have

Master's degree or PhD in Reliability, Electrical, Industrial, pr Mechanical Engineering, or equivalent practical experience.
5 years of experience overseeing yield improvements and cycle time reductions for high volume and high complexity parts.
Experience with system level reliability tools such as reliability block diagrams (RBDs), mean cumulative function (MCF), homogeneous and non-homogeneous poisson processes (HPP, NHPP), and simulation tools.
Experience with failure analysis and fault isolation techniques and how to apply them to find root causes of failure.
Understanding of physics of failure and reliability physics.

What the JD emphasized

hardware reliability

machine learning

server

networking

storage products

early system configuration analysis

simulations

reliability capability

design

power-on reset configurations

system architect

product teams

design options

materials

sub-components/modules/sub-systems

reliability plans

samples

testing needs

execute the plan

verify the product

reliability targets

contract manufacturer

ongoing reliability program

field excursion issue resolution

statistical analysis

AI and Infrastructure team

breakthrough capabilities

insights

AI and Infrastructure at unparalleled scale

efficiency

reliability

velocity

Google customers

billions of Google users

Google's groundbreaking innovations

cutting-edge AI models

unparalleled computing power

global services

essential platforms

developers

software to hardware

world-leading hyperscale computing

TPUs

Vertex AI for Google Cloud

Google Global Networking

Data Center operations

systems research

Be part of a team that pushes boundaries, developing custom silicon solutions that power the future of Google's direct-to-consumer products. You'll contribute to the innovation behind products loved by millions worldwide. Your expertise will shape the next generation of hardware experiences, delivering unparalleled performance, efficiency, and integration.

In this role, you will manage the hardware reliability of new machine learning, server, networking, and storage products for early system configuration analysis and simulations, to assess reliability capability of the design and power-on reset configurations. You will manage early engagement with the system architect and product teams to drive the selection of design options, materials, and sub-components/modules/sub-systems. You will also define reliability plans for new products, securing resources for required samples and testing needs.

You will execute the plan and verify the product can meet reliability targets and define and implement at the contract manufacturer an ongoing reliability program for the product. You will support field excursion issue resolution needs and will also support statistical analysis needs for reliability.

The AI and Infrastructure team is redefining what’s possible. We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and velocity. Our customers include Googlers, Google Cloud customers, and billions of Google users worldwide.

We're the driving force behind Google's groundbreaking innovations, empowering the development of our cutting-edge AI models, delivering unparalleled computing power to global services, and providing the essential platforms that enable developers to build the future. From software to hardware our teams are shaping the future of world-leading hyperscale computing, with key teams working on the development of our TPUs, Vertex AI for Google Cloud, Google Global Networking, Data Center operations, systems research, and much more.

The US base salary range for this full-time position is $120,000-$172,000 + bonus + equity + benefits. Our salary ranges are determined by role, level, and location. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training. Your recruiter can share more about the specific salary range for your preferred location during the hiring process. Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about benefits at Google.

Responsibilities

Lead analysis of system hardware designs to enable proactive design evaluations and product de-risk at an early stage of development.
Lead system reliability efforts by working with other organizations to define reliability goals and reliability plans, securing the resources needed to execute the plan.
Implement the reliability plan and lead all efforts to assess and mitigate risk of failure early during New Product Introduction (NPI).
Drive reliability test plans and collect, analyze, and synthesize the test data to enable verification of the design reliability goals.
Lead system reliability monitoring efforts (availability, repair trends) and proactively alert product teams on unwanted system behavior, working on mitigation strategy definition and implementation.

Qualifications

Minimum qualifications:

Bachelor's degree in Reliability, Electrical, Industrial, or Mechanical Engineering, or equivalent practical experience.
5 years of experience in manufacturing.
Experience with failure analysis and fault isolation techniques and how to apply them to find root causes of failure.

Preferred qualifications:

Master's degree or PhD in Reliability, Electrical, Industrial, pr Mechanical Engineering, or equivalent practical experience.
5 years of experience overseeing yield improvements and cycle time reductions for high volume and high complexity parts.
Experience with system level reliability tools such as reliability block diagrams (RBDs), mean cumulative function (MCF), homogeneous and non-homogeneous poisson processes (HPP, NHPP), and simulation tools.
Experience with failure analysis and fault isolation techniques and how to apply them to find root causes of failure.
Understanding of physics of failure and reliability physics.

Responsibilities

Lead analysis of system hardware designs to enable proactive design evaluations and product de-risk at an early stage of development.
Lead system reliability efforts by working with other organizations to define reliability goals and reliability plans, securing the resources needed to execute the plan.
Implement the reliability plan and lead all efforts to assess and mitigate risk of failure early during New Product Introduction (NPI).
Drive reliability test plans and collect, analyze, and synthesize the test data to enable verification of the design reliability goals.
Lead system reliability monitoring efforts (availability, repair trends) and proactively alert product teams on unwanted system behavior, working on mitigation strategy definition and implementation.

Qualifications

Minimum qualifications:

Bachelor's degree in Reliability, Electrical, Industrial, or Mechanical Engineering, or equivalent practical experience.
5 years of experience in manufacturing.
Experience with failure analysis and fault isolation techniques and how to apply them to find root causes of failure.

Preferred qualifications:

Master's degree or PhD in Reliability, Electrical, Industrial, pr Mechanical Engineering, or equivalent practical experience.
5 years of experience overseeing yield improvements and cycle time reductions for high volume and high complexity parts.
Experience with system level reliability tools such as reliability block diagrams (RBDs), mean cumulative function (MCF), homogeneous and non-homogeneous poisson processes (HPP, NHPP), and simulation tools.
Experience with failure analysis and fault isolation techniques and how to apply them to find root causes of failure.
Understanding of physics of failure and reliability physics.