Data Center Production Operations Engineer

Meta Meta · Big Tech · Bowling Green, OH

Meta is looking for an experienced engineer to join their Data Center Production Operations team. The role involves supporting platform health, performing root cause analysis of technical issues, collaborating with cross-functional teams, and managing the introduction of new hardware. Responsibilities include troubleshooting hardware and software issues at scale, using scripting and automation, and driving improvements in processes and tools. The ideal candidate will have a strong background in server hardware, Linux, and data center operations, with experience in areas like hardware repair, OS management, tooling, networking, or technical project management.

What you'd actually do

  1. Support platform health by successfully resolving and closing tickets, while addressing the overall issue (i.e. addressing root cause) including, but not limited to, remote troubleshooting and physical inspection of services in data halls
  2. Participate in deep dives and root cause analysis of highly technical issues within the data center, ranging from automated tooling to hardware failures and network issues
  3. Collaborate with cross-functional teams on projects and initiatives related to topics such as process, hardware and automation
  4. Point of contact for the introduction of new platforms and hardware to the site, in collaboration with partners and global resources, accelerating the time it takes to bring these products to sustained mass production
  5. Use tools and data analysis effectively to identify issues. Take actions to communicate with all stakeholders appropriately and manage or escalate as needed

Skills

Required

  • Server hardware
  • Linux administration
  • Troubleshooting
  • Automation
  • Scripting (Bash, PHP, Python, SQL, Rust, Go or Perl)
  • Data analysis
  • Technical project management
  • Networking fundamentals
  • OS management

Nice to have

  • Six Sigma knowledge/certification
  • Experience with large-scale AI implementations

What the JD emphasized

  • advanced, hands-on technical skills in server hardware and Linux
  • large-scale distributed data center environment
  • working knowledge and experience in a few of the following core areas: Hardware repair, OS management, Tooling and Automation, Networking, or Technical Project Management
  • Intermediate-level understanding in Linux (or equivalent OS) in a complex IT environment with the ability to triage, debug, and troubleshoot server issues
  • Hands-on experience and knowledge of server hardware and components, including storage
  • Intermediate-level knowledge of the interdependencies of data center functions and technologies including electrical, cooling, structured cabling, security, and network
  • Experience managing technical issues and driving to the root cause
  • Experience participating in technical projects related to areas such as process improvement, technology, and/or automation
  • Knowledge of out-of-band/lights-out server communication methods, such as IPMI and serial console
  • Experience in a large-scale data center environment