Staff Site Reliability Engineer

Figure AI Figure AI · Robotics · HQ · Platform Software

Figure AI is seeking a Staff Site Reliability Engineer to manage their internal systems infrastructure, focusing on cloud and on-prem systems for high availability and reliability. The role involves automation, monitoring, incident response, and collaboration with cross-functional teams to support the development of autonomous humanoid robots.

What you'd actually do

  1. Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more.
  2. Migrate SaaS to self-hosted solutions to enhance security and reliability.
  3. Implement monitoring and alerting systems, and define incident response plans and runbooks.
  4. Reduce human workload through automation to automate deployment and scaling.
  5. Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives.

Skills

Required

  • Linux/Unix systems administration
  • programming/scripting
  • cloud platforms (Azure, AWS, GCP)
  • on-prem hardware architectures
  • designing, deploying, and operating high-availability, fault-tolerant, and distributed systems
  • infrastructure as code (Terraform, CloudFormation, Ansible…)
  • monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
  • networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
  • Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets
  • work in cross-functional teams
  • Excellent verbal and written communication skills

What the JD emphasized

  • mission critical infrastructure
  • highly available, reliable, and automated systems
  • high-availability, fault-tolerant, and distributed systems
  • Service Level Objectives (SLO)