Production Systems Engineer, AI Systems

Meta Meta · Big Tech · Menlo Park, CA

This role focuses on the hardware systems engineering for AI/HPC servers, specifically in the Release to Production (RTP) team. Responsibilities include end-to-end system validation, bring-up, deployment, issue investigation, and driving automation for new hardware systems in large-scale data centers. The role requires significant experience with hardware engineering, server systems, and networking, with a focus on ensuring the reliability and performance of AI hardware.

What you'd actually do

  1. Drive and execute comprehensive end-to-end system validation strategy (hardware and software) for various AI/HPC hardware systems in datacenter applications
  2. Lead the bring-up, validation, and deployment of cutting-edge hardware systems in large-scale deployment with active hands-on participation
  3. Explore new use cases with customer teams and identify related test methodologies/test cases accordingly
  4. Investigate and troubleshoot complex failures potentially related to hardware systems with cross-functional teams
  5. Triage failures and continue root-causing while driving project development work forward

Skills

Required

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 8+ years of experience in hands-on software, firmware or hardware engineering to build any of the following products (AI silicon, GPUs, TPUs, Autonomous cars, AI servers)
  • Experience in one or more domains such as: ASIC development (silicon design, bringup, characterization, validation), board-level debug, firmware validation, system validation
  • Knowledge of architecture and components on one of the following products: server/PC/Laptop
  • Development or debug experience in one or more of the following areas: hardware fault management, error reporting, error handling on hardware products
  • 6+ years experience in Networking space: Switches, Network Interface Cards (NICs), DPU etc
  • Knowledge of TCP/IP and experience using tools like iperf/uperf
  • Experience working with RDMA/RoCE, including scale-out networks
  • Experience working with AI server systems
  • Experience working with large scale deployments

What the JD emphasized

  • AI/HPC hardware systems
  • large-scale deployment
  • hardware fault management
  • AI server systems