Production Systems Engineer, Automation

Meta Meta · Big Tech · Menlo Park, CA

Production Systems Engineer focused on improving the reliability, efficiency, and scalability of Meta's large-scale hardware infrastructure through test automation. The role involves designing and building systems tooling, test automation, and frameworks for compute, storage, and networking infrastructure, with a focus on AI hardware platforms.

What you'd actually do

  1. Design, build, and scale test orchestration and validation tooling, CI/CD pipelines, and automation frameworks that qualify large-scale AI hardware platforms at cluster scale — spanning provisioning, monitoring, and lifecycle management of compute, storage, and networking infrastructure
  2. Develop tooling for hardware lifecycle management, fleet health observability, and automated remediation of production system failures across Meta's data center fleets
  3. Identify and resolve systemic reliability and performance issues by analyzing hardware telemetry, failure patterns, and system-level diagnostics at scale
  4. Collaborate with hardware engineering teams to define software interfaces, firmware integration requirements, and bring-up workflows for new server and accelerator platforms
  5. Lead cross-functional efforts to evaluate, qualify, and integrate new hardware technologies into the production environment, including validation and qualification workflows

Skills

Required

  • production systems engineering
  • infrastructure software engineering
  • C, C++, or Python for Linux-based environments
  • large-scale hardware infrastructure systems
  • fleet automation
  • hardware lifecycle management
  • data center operations software
  • designing and operating distributed systems software at scale
  • monitoring, alerting, and automated remediation pipelines
  • communicating system designs and technical decisions
  • troubleshooting skills across hardware products and automation software
  • building or operating CI/CD pipelines and test automation frameworks for infrastructure software

Nice to have

  • Master's Degree in Computer Science, Computer Engineering, or similar field
  • variety of infrastructure components such as network, and compute in a datacenter or large-scale production environment
  • custom silicon or accelerator platform integration, including firmware and platform management interfaces
  • guiding cross-functional teams or ODM/vendor partners through the setup, integration, and execution of automation and validation frameworks at scale

What the JD emphasized

  • large-scale AI hardware platforms
  • qualify large-scale AI hardware platforms
  • custom silicon or accelerator platform integration