Hardware Systems Engineer

Meta Meta · Big Tech · Menlo Park, CA

Meta is seeking a Hardware Systems Engineer to work on the end-to-end system validation strategy for AI/HPC hardware systems in datacenter applications. This role involves leading the bring-up, validation, and deployment of cutting-edge hardware systems, troubleshooting complex failures, and driving automation efforts. The ideal candidate will have 8+ years of experience in hardware/software engineering related to AI Silicon, GPUs, TPUs, or AI servers, with expertise in areas like ASIC development, system validation, and debugging.

What you'd actually do

  1. Drive and execute end-to-end system validation strategy (hardware and software), with a focus on various AI/HPC hardware systems in datacenter applications
  2. Lead the bring-up, validation, and deployment of cutting-edge hardware systems in large scale deployment with active hands-on participations
  3. Explore new use cases with customer teams and identify related test methodologies/test cases accordingly
  4. Investigate and troubleshoot complex failures potentially related to Hardware systems with cross-function teams, which may involve different stacks like silicon, firmware, software, etc
  5. Triage failures and continue rootcausing while driving project development work forward

Skills

Required

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 8+ years of experience in hands-on SW, FW or HW engineering to build any of the following products (AI Silicon, GPUs, TPUs, Autonomous cars, AI servers)
  • Experience in one or more domains such as: ASIC development (Silicon design, bringup, characterization, validation), board level debug, firmware validation, system validation
  • Experience with leading Silicon or System troubleshooting and debugging
  • Experience in developing test specifications, procedures, and debug guides for test solutions
  • 8+ years of experience integrating lab tools for automated workflows and managing large-scale deployments
  • 8+ years of experience with one or more of the following modules/domains: PCIe, NVlink, Networking, Flash, Memory, CPU, GPU, TPU, DRAM (DDR4/5 or HBM), AI silicon/AI accelerators
  • 5+ years of experience with using continuous integration and version control tools for system development and testing
  • 5+ years of experience in software, firmware, and hardware engineering to develop systems/products for datacenter applications such as video processing, AI/ML, and networking
  • 5+ years of experience with definition of HW/SW interface requirements for telemetry, diagnostics, debugging
  • Experience with debugging tools for SoCs (e.g., JTAG, GDB, Trace32) and knowledge of common bus protocols such as I2C, SPI, USB, and PCIe
  • Proficiency in High-Performance Computing (HPC) or AI system architecture at rack level and at scale
  • Proficiency in Linux environment and server system management

What the JD emphasized

  • AI/HPC hardware systems
  • datacenter applications
  • large scale deployment
  • AI silicon/AI accelerators