Meta is seeking a Production Systems Engineer, Tooling to join our Production Systems Engineering organization, where you will help drive the reliability, efficiency, and scalability of Meta's large-scale hardware infrastructure through improvements by test automation. You will design and build the systems tooling, test automation, and frameworks that keep Meta's global production fleet — spanning compute, storage, networking, and custom silicon — operating at peak performance. Working at the intersection of hardware and software, you will partner with data center operations, hardware engineering, platform teams, and ODM/vendor partners to drive systemic improvements across the full infrastructure stack.
Responsibilities
Design, build, and scale test orchestration and validation tooling, CI/CD pipelines, and automation frameworks that qualify large-scale AI hardware platforms at cluster scale — spanning provisioning, monitoring, and lifecycle management of compute, storage, and networking infrastructure Develop tooling for hardware lifecycle management, fleet health observability, and automated remediation of production system failures across Meta's data center fleets Identify and resolve systemic reliability and performance issues by analyzing hardware telemetry, failure patterns, and system-level diagnostics at scale Collaborate with hardware engineering teams to define software interfaces, firmware integration requirements, and bring-up workflows for new server and accelerator platforms Lead cross-functional efforts to evaluate, qualify, and integrate new hardware technologies into the production environment, including validation and qualification workflows Develop scalable infrastructure automation that reduces operational toil and accelerates hardware deployment and remediation across the global fleet Mentor other engineers on systems software design, debugging methodologies, and production infrastructure best practices Communicate technical designs and architectural decisions through written documentation and cross-functional stakeholder alignment
Qualifications
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience 3+ years of experience in production systems engineering or infrastructure software engineering, including development in C, C++, or Python for Linux-based environments 3+ years of experience with large-scale hardware infrastructure systems, including fleet automation, hardware lifecycle management, or data center operations software 3+ years of experience in designing and operating distributed systems software at scale, including monitoring, alerting, and automated remediation pipelines 3+ years of experience in communicating system designs and technical decisions through written documentation and cross-functional stakeholder engagement Demonstrated troubleshooting skills across hardware products and automation software Master's Degree in Computer Science, Computer Engineering, or similar field 6+ years of experience across a variety of infrastructure components such as network, and compute in a datacenter or large-scale production environment 3+ years of experience in building or operating CI/CD pipelines and test automation frameworks for infrastructure software Familiarity with custom silicon or accelerator platform integration, including firmware and platform management interfaces Expertise guiding cross-functional teams or ODM/vendor partners through the setup, integration, and execution of automation and validation frameworks at scale