Data Center Production Operations Engineer

Meta Meta · Big Tech · Singapore

This role is for a Data Center Production Operations Engineer responsible for maintaining server hardware and Linux systems in a large-scale data center environment. Key responsibilities include ticket resolution, root cause analysis, cross-functional collaboration, hardware repair, OS management, tooling and automation, networking, and technical project management. The role requires hands-on technical skills, problem-solving abilities, and experience with scripting and data analysis to ensure platform health and optimize efficiency.

What you'd actually do

  1. Support platform health by successfully resolving and closing tickets, while addressing the overall issue (i.e. addressing root cause) including, but not limited to, remote troubleshooting and physical inspection of services in data halls
  2. Participate in n-depth exploration and root cause analysis of highly technical issues within the data center, ranging from automated tooling to hardware failures and network issues
  3. Collaborate with cross-functional teams on projects and initiatives related to topics such as process, hardware and automation
  4. Point of contact for the introduction of new platforms and hardware to the site, in collaboration with partners and global resources, accelerating the time it takes to bring these products to sustained mass production
  5. Use tools and data analysis effectively to identify issues. Take actions to communicate with all stakeholders appropriately and manage or escalate as needed

Skills

Required

  • advanced, hands-on technical skills in server hardware
  • Linux
  • server administration
  • Hardware repair
  • OS management
  • Tooling and Automation
  • Networking
  • Technical Project Management
  • troubleshooting
  • root cause analysis
  • scripting
  • data analysis
  • Bash
  • PHP
  • Python
  • SQL
  • Rust
  • Go
  • Perl
  • IPMI
  • serial console
  • Six Sigma knowledge/certification

Nice to have

  • Data Center environment
  • large-scale distributed data center environment
  • electrical
  • cooling
  • structured cabling
  • security
  • network
  • HTTP
  • DNS
  • RAID
  • DHCP
  • technical guidance to external vendors
  • large-scale AI implementations
  • integrate AI tools to optimize/redesign workflows
  • responsible, ethical AI practices
  • ongoing AI skill development
  • prompt/context engineering
  • agent orchestration

What the JD emphasized

  • server hardware
  • Linux
  • large-scale distributed data center environment
  • Hardware repair
  • OS management
  • Tooling and Automation
  • Networking
  • Technical Project Management
  • root cause analysis
  • scripting
  • automation
  • data analytics
  • large-scale data center environment
  • large-scale AI implementations
  • integrate AI tools to optimize/redesign workflows
  • responsible, ethical AI practices
  • ongoing AI skill development