System Hardware Reliability Engineer

Google Google · Big Tech · Sunnyvale, CA +1

This role focuses on the hardware reliability of new machine learning, server, networking, and storage products for Google's direct-to-consumer products. The engineer will manage early system configuration analysis and simulations to assess design reliability, drive the selection of design options and components, define and execute reliability plans, and implement ongoing reliability programs with contract manufacturers. The role also involves supporting field excursion issue resolution and statistical analysis for reliability.

What you'd actually do

  1. Lead analysis of system hardware designs to enable proactive design evaluations and product de-risk at an early stage of development.
  2. Lead system reliability efforts by working with other organizations to define reliability goals and reliability plans, securing the resources needed to execute the plan.
  3. Implement the reliability plan and lead all efforts to assess and mitigate risk of failure early during New Product Introduction (NPI).
  4. Drive reliability test plans and collect, analyze, and synthesize the test data to enable verification of the design reliability goals.
  5. Lead system reliability monitoring efforts (availability, repair trends) and proactively alert product teams on unwanted system behavior, working on mitigation strategy definition and implementation.

Skills

Required

  • Bachelor's degree in Reliability, Electrical, Industrial, or Mechanical Engineering, or equivalent practical experience.
  • 5 years of experience in manufacturing.
  • Experience with failure analysis and fault isolation techniques and how to apply them to find root causes of failure.

Nice to have

  • Master's degree or PhD in Reliability, Electrical, Industrial, pr Mechanical Engineering, or equivalent practical experience.
  • 5 years of experience overseeing yield improvements and cycle time reductions for high volume and high complexity parts.
  • Experience with system level reliability tools such as reliability block diagrams (RBDs), mean cumulative function (MCF), homogeneous and non-homogeneous poisson processes (HPP, NHPP), and simulation tools.
  • Experience with failure analysis and fault isolation techniques and how to apply them to find root causes of failure.
  • Understanding of physics of failure and reliability physics.

What the JD emphasized

  • hardware reliability
  • machine learning
  • server
  • networking
  • storage products
  • early system configuration analysis
  • simulations
  • reliability capability
  • design
  • power-on reset configurations
  • system architect
  • product teams
  • design options
  • materials
  • sub-components/modules/sub-systems
  • reliability plans
  • samples
  • testing needs
  • execute the plan
  • verify the product
  • reliability targets
  • contract manufacturer
  • ongoing reliability program
  • field excursion issue resolution
  • statistical analysis
  • AI and Infrastructure team
  • breakthrough capabilities
  • insights
  • AI and Infrastructure at unparalleled scale
  • efficiency
  • reliability
  • velocity
  • Google customers
  • billions of Google users
  • Google's groundbreaking innovations
  • cutting-edge AI models
  • unparalleled computing power
  • global services
  • essential platforms
  • developers
  • software to hardware
  • world-leading hyperscale computing
  • TPUs
  • Vertex AI for Google Cloud
  • Google Global Networking
  • Data Center operations
  • systems research