Data Center Production Operations Engin… at Meta

What you'd actually do

Perform in-depth exploration and root cause analysis of complex technical issues within the data center, ranging from automated tooling to hardware failures and network issues

Facilitate collaboration with cross-functional teams on projects and initiatives related to topics such as process, hardware and automation

Lead the introduction of new platforms and hardware to the site and geographical area, in collaboration with partners and global resources, accelerating the time it takes to bring these products to sustained mass production

Use tools and data analysis effectively to identify issues that are larger in scope and which impact one or multiple Data Centers.

Skills

Required

Expert in Linux (or equivalent OS) in a complex IT environment with the ability to triage, debug, and troubleshoot complex, systemic issues
Hands-on experience and knowledge of server hardware and components, including storage
Experience managing multiple technical issues concurrently driving to the root cause
Experience participating in or leading technical projects related to areas such as process improvement, technology, and/or automation.
Knowledge of out-of-band/lights-out server communication methods, such as IPMI and serial console
Proven experience in fostering growth in others, and driving influence across all organizational levels
Experience with large-scale AI implementations
Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies

Nice to have

BS, BA or BEng in technical field or commensurate experience
7+ years of technical IT experience within an infrastructure environment, in a role such as Systems Administrator, DevOps Engineer, or Site Reliability Engineer
Experience of the interdependencies of data center functions and technologies including electrical, cooling, structured cabling, security, and network
Ability to communicate effectively, in a clear and concise manner, appropriately tailoring messages to the audience
Extensive technical knowledge of technologies such as HTTP, DNS, RAID, and DHCP
Experience in providing technical guidance to external vendors
Experience in debugging, modifying and developing commonly used scripting or programming languages in at least one of these languages: Bash, PHP, Python, SQL, Rust, Go or Perl
Use data analytics to drive maximum server up-time and utilization rates, understanding hardware failure rates and service level agreements
Six Sigma knowledge/certification

What the JD emphasized

advanced, hands-on technical skills in server hardware and Linux

extensive knowledge of server administration and performing on complex projects in a large-scale distributed data center environment

Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)

Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)

Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies

Meta is seeking a forward thinking experienced engineer to join the Production Operations team within our Data Centers. These Data Centers are the foundation upon which our rapidly scaling infrastructure efficiently operates and upon which our innovative services are delivered. Meta is at the leading edge of the global data center industry both in terms of how data centers are designed and operated. This person should enjoy working in a fast paced, technical environment where adaptability and flexibility will be key to their success. We seek an IT professional with advanced, hands-on technical skills in server hardware and Linux - ideally in a Data Center environment. Having extensive knowledge of server administration and performing on complex projects in a large-scale distributed data center environment is a core competency of this individual. The candidate should also have good knowledge and experience in a few of the following core areas: Hardware repair, OS management, Tooling and Automation, Networking, or Technical Project Management.

Responsibilities

Support platform health by successfully resolving and closing complex tickets, while addressing the overall issue (i.e. addressing root cause) including, but not limited to, remote troubleshooting and physical inspection of services in data halls Perform in-depth exploration and root cause analysis of complex technical issues within the data center, ranging from automated tooling to hardware failures and network issues Facilitate collaboration with cross-functional teams on projects and initiatives related to topics such as process, hardware and automation Lead the introduction of new platforms and hardware to the site and geographical area, in collaboration with partners and global resources, accelerating the time it takes to bring these products to sustained mass production Use tools and data analysis effectively to identify issues that are larger in scope and which impact one or multiple Data Centers. Take actions to communicate with all stakeholders appropriately and manage or escalate as needed Drive corrective actions of complex hardware issues, work with internal teams and vendors; provide an ownership stake, and influence future design changes to ensure ease of serviceability Solve complex and systemic hardware and/or software issues at scale using scripting, automation, and tooling to drive global resolution Continuously evaluate and identify areas for improvement in processes, tools, and systems to optimize efficiency and quality of repairs Use data analytics to drive maximum server up-time and utilization rates, understanding hardware failure rates and service level agreements Coach and mentor team members to evaluate and identify better ways to resolve issues, and define updates to tools and processes Provide engineering support and be a go-to technical resource and Subject Matter Expert for the team, leadership, and cross-functional teams in all aspects of operating and maintaining data center servers Maintain and update documentation i.e. procedures, runbooks and guides Build cross functional relationships and influence policies and procedures that improve global data center operations Participate in 24/7 on-call rotation Ability to travel up to 15% of the time

Qualifications

BS, BA or BEng in technical field or commensurate experience 7+ years of technical IT experience within an infrastructure environment, in a role such as Systems Administrator, DevOps Engineer, or Site Reliability Engineer Expert in Linux (or equivalent OS) in a complex IT environment with the ability to triage, debug, and troubleshoot complex, systemic issues Hands-on experience and knowledge of server hardware and components, including storage Experience of the interdependencies of data center functions and technologies including electrical, cooling, structured cabling, security, and network Experience managing multiple technical issues concurrently driving to the root cause Experience participating in or leading technical projects related to areas such as process improvement, technology, and/or automation. Brings peers, partners and other resources into the project where additional expertise is needed, and to provide growth and learning opportunities for others Ability to communicate effectively, in a clear and concise manner, appropriately tailoring messages to the audience Extensive technical knowledge of technologies such as HTTP, DNS, RAID, and DHCP Experience in providing technical guidance to external vendors Experience in debugging, modifying and developing commonly used scripting or programming languages in at least one of these languages: Bash, PHP, Python, SQL, Rust, Go or Perl Knowledge of out-of-band/lights-out server communication methods, such as IPMI and serial console Experience using data and metrics to drive decisions Proven experience in fostering growth in others, and driving influence across all organizational levels Experience in a large-scale data center environment Experience with large-scale AI implementations Six Sigma knowledge/certification Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements) Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews) Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies