Site Reliability Engineer - US Government

Palantir Palantir · Enterprise · Washington, DC · Engineering

Site Reliability Engineer responsible for building, operating, and maintaining high-performance, scalable, and reliable services for Palantir's production infrastructure in both cloud and on-prem environments. This role involves automating processes, collaborating with product teams on requirements and SLOs, troubleshooting complex systems issues, and ensuring timely resolution of production incidents.

What you'd actually do

  1. Maintaining availability of cloud & physical Linux servers that power the Palantir platform in air-gapped production environments
  2. Design, deploy, and operate infrastructure to support customer & product requirements via modern orchestration & monitoring platforms.
  3. Collaborate closely with product teams on requirements & SLOs for deploying software into air-gapped environments.
  4. Identifying, troubleshooting, and solving network & systems issues
  5. Scripting to automate away routine operational tasks

Skills

Required

  • Linux system administration
  • cloud-based hosting platforms (AWS, Azure, or GCP)
  • hardware-based environments
  • monitoring systems (Prometheus)
  • writing health checks
  • programming language (Java, Go, Python, JavaScript, Bash, or similar)

Nice to have

  • containers (Docker/Podman)
  • orchestration (OpenShift/Kubernetes) at scale
  • DOD 8570 IAT Level II or greater (CISSP, Sec+)
  • Unix/Linux Computing Environment (e.g Linux+, RHCE)

What the JD emphasized

  • Active security clearance
  • 4+ years of experience with Linux system administration (RHEL or equivalent preferred)