Data Center Production Operations Engineer

Meta Meta · Big Tech · Singapore

Meta is seeking an experienced engineer for their Data Center Production Operations team. The role involves supporting platform health, performing root cause analysis of complex technical issues, leading the introduction of new hardware, and driving corrective actions for hardware/software issues at scale. The candidate should have strong Linux and server hardware skills, with experience in areas like hardware repair, OS management, tooling, automation, and networking. The role also emphasizes using data analytics to optimize server uptime and utilization, and requires participation in on-call rotations. While not directly building AI models, the role requires integrating AI tools to optimize workflows, adhering to ethical AI practices, and demonstrating ongoing AI skill development.

What you'd actually do

  1. Support platform health by successfully resolving and closing complex tickets, while addressing the overall issue (i.e. addressing root cause) including, but not limited to, remote troubleshooting and physical inspection of services in data halls
  2. Perform in-depth exploration and root cause analysis of complex technical issues within the data center, ranging from automated tooling to hardware failures and network issues
  3. Facilitate collaboration with cross-functional teams on projects and initiatives related to topics such as process, hardware and automation
  4. Lead the introduction of new platforms and hardware to the site and geographical area, in collaboration with partners and global resources, accelerating the time it takes to bring these products to sustained mass production
  5. Use tools and data analysis effectively to identify issues that are larger in scope and which impact one or multiple Data Centers.

Skills

Required

  • Expert in Linux (or equivalent OS) in a complex IT environment with the ability to triage, debug, and troubleshoot complex, systemic issues
  • Hands-on experience and knowledge of server hardware and components, including storage
  • Experience managing multiple technical issues concurrently driving to the root cause
  • Experience participating in or leading technical projects related to areas such as process improvement, technology, and/or automation.
  • Knowledge of out-of-band/lights-out server communication methods, such as IPMI and serial console
  • Proven experience in fostering growth in others, and driving influence across all organizational levels
  • Experience with large-scale AI implementations
  • Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
  • Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
  • Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies

Nice to have

  • BS, BA or BEng in technical field or commensurate experience
  • 7+ years of technical IT experience within an infrastructure environment, in a role such as Systems Administrator, DevOps Engineer, or Site Reliability Engineer
  • Experience of the interdependencies of data center functions and technologies including electrical, cooling, structured cabling, security, and network
  • Ability to communicate effectively, in a clear and concise manner, appropriately tailoring messages to the audience
  • Extensive technical knowledge of technologies such as HTTP, DNS, RAID, and DHCP
  • Experience in providing technical guidance to external vendors
  • Experience in debugging, modifying and developing commonly used scripting or programming languages in at least one of these languages: Bash, PHP, Python, SQL, Rust, Go or Perl
  • Use data analytics to drive maximum server up-time and utilization rates, understanding hardware failure rates and service level agreements
  • Six Sigma knowledge/certification

What the JD emphasized

  • advanced, hands-on technical skills in server hardware and Linux
  • extensive knowledge of server administration and performing on complex projects in a large-scale distributed data center environment
  • Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
  • Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
  • Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies