Global Production Systems Engineer

Meta Meta · Big Tech · Menlo Park, CA +1

Meta is looking for a Production Systems Engineer to join their Data Center Operations team. The role involves identifying and resolving systemic issues in the server fleet, driving automation solutions, and ensuring maximum server uptime and utilization. Responsibilities include root cause analysis, developing diagnostic tooling, collaborating with cross-functional teams, and mentoring team members. Requires strong experience in production systems engineering, scripting languages (Python, PHP, C, C++), Linux environments, and hardware lifecycle management for large-scale hardware.

What you'd actually do

  1. Identify and root cause systemic issues in the fleet and drive resolutions.
  2. Deliver maximum server fleet uptime and utilization rates, by leveraging data to understand hardware failure conditions and root cause
  3. Write and review code, develop documentation, and debug the hardest problems, live, on some of the largest and most complex systems in the world
  4. Own and develop diagnostic tooling requirements to run the fleet
  5. Own and drive the escalation process for Data Center Operations to identify, root cause, and solve complex tooling and hardware issues affecting the fleet

Skills

Required

  • production systems engineering
  • infrastructure engineering
  • systems software development
  • large-scale hardware environments
  • hardware lifecycle management
  • fleet automation
  • data center operations systems
  • Python
  • PHP
  • C
  • C++
  • Linux
  • web servers
  • load balancers
  • relational databases
  • storage systems
  • messaging systems
  • configuration management
  • infrastructure-as-code
  • distributed systems monitoring
  • alerting
  • automated remediation pipelines

Nice to have

  • multi-site data center infrastructure deployments
  • hardware qualification
  • regional rollout coordination

What the JD emphasized

  • deep experience in utilizing multiple diverse software tools to identify automation solutions intended to address complex operational issues
  • deep data analysis to drive decisions on the top priorities for automating repairs on servers in a hyperscale environment
  • driving solutions through code
  • programming in scripting languages
  • administering Linux systems is required
  • 6+ years of experience in production systems engineering, infrastructure engineering, or systems software development for large-scale hardware environments
  • 6+ years of experience with hardware lifecycle management, fleet automation, or data center operations systems spanning compute, storage, or networking infrastructure
  • Experience developing systems software or tooling in Python, PHP, C, or C++ for Linux-based production environments at scale