What you'd actually do

Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more.

Migrate SaaS to self-hosted solutions to enhance security and reliability.

Implement monitoring and alerting systems, and define incident response plans and runbooks.

Reduce human workload through automation to automate deployment and scaling.

Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives.

Skills

Required

Linux/Unix systems administration
programming/scripting
cloud platforms (Azure, AWS, GCP)
on-prem hardware architectures
designing, deploying, and operating high-availability, fault-tolerant, and distributed systems
infrastructure as code (Terraform, CloudFormation, Ansible…)
monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets
work in cross-functional teams
Excellent verbal and written communication skills

Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home and commercial markets. Figure is headquartered in San Jose, CA.

We are looking for a Site Reliability Engineer to own our internal systems infrastructure. This role is responsible for setting up and managing cloud and on-prem infrastructure to deliver highly available, reliable, and automated systems.

Responsibilities:

Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more.
Migrate SaaS to self-hosted solutions to enhance security and reliability.
Implement monitoring and alerting systems, and define incident response plans and runbooks.
Reduce human workload through automation to automate deployment and scaling.
Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives.
Use a data driven approach to demonstrate service robustness and track optimization work.
Partner with the security team to ensure that security remediations and updates are applied in a timely manner.

Requirements:

Strong experience with Linux/Unix systems administration
Proficiency in programming/scripting
Extensive experience with cloud platforms (Azure, AWS, GCP) and on-prem hardware architectures
Experience designing, deploying, and operating high-availability, fault-tolerant, and distributed systems.
Mastery of infrastructure as code (Terraform, CloudFormation, Ansible…)
Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
Experience defining Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets.
Ability to work in cross-functional teams with developers, infra, and product teams
Excellent verbal and written communication skills

The US base salary range for this full-time position is between $175,000 - $250,000 annually.

The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.

Responsibilities:

Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more.
Migrate SaaS to self-hosted solutions to enhance security and reliability.
Implement monitoring and alerting systems, and define incident response plans and runbooks.
Reduce human workload through automation to automate deployment and scaling.
Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives.
Use a data driven approach to demonstrate service robustness and track optimization work.
Partner with the security team to ensure that security remediations and updates are applied in a timely manner.

Requirements:

Strong experience with Linux/Unix systems administration
Proficiency in programming/scripting
Extensive experience with cloud platforms (Azure, AWS, GCP) and on-prem hardware architectures
Experience designing, deploying, and operating high-availability, fault-tolerant, and distributed systems.
Mastery of infrastructure as code (Terraform, CloudFormation, Ansible…)
Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
Experience defining Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets.
Ability to work in cross-functional teams with developers, infra, and product teams
Excellent verbal and written communication skills

The US base salary range for this full-time position is between $175,000 - $250,000 annually.

Staff Site Reliability Engineer

What you'd actually do

Skills

Required

What the JD emphasized