Product Reliability Engineer - Defense

Palantir Palantir · Enterprise · Washington, DC · Product Support

Palantir is seeking a Product Reliability Engineer to ensure the health, performance, and stability of their services, particularly for defense customers. This role involves end-to-end ownership of service reliability, including responding to outages, improving codebases, building lasting solutions, and contributing to forward-looking product work like infrastructure migrations and stability enhancements. The engineer will also participate in on-call rotations and collaborate with other product teams.

What you'd actually do

  1. Continuously invest in documentation, metrics, monitors and other troubleshooting tools
  2. Participate in on-call rotations during business hours and occasional weekends. This is a challenging yet rewarding opportunity to help remediate the most pressing issues across the Palantir fleet.
  3. Diagnose, resolve, and prevent issues encountered in the field. Deliver end-to-end improvements to core products based on these issues you encounter in the field.
  4. Improve observability by refactoring codepaths and introducing telemetry
  5. Identify and implement data-driven opportunities for improved service resilience
  6. Develop strategic opinions on stability investments and inform the vision for long-term product stability

Skills

Required

  • Engineering background in Computer Science, Mathematics, Software Engineering, Physics or similar field
  • Ability to work with a high degree of ownership and a strong sense of urgency in a dynamic environment
  • Experience producing code in backend languages such as Java, as part of a past role or personal projects
  • Familiarity with storage and data processing systems and cloud infrastructure
  • Strong written and verbal communication and ability to iterate quickly with teammates and incorporate feedback
  • Eligibility and willingness to obtain a US Security clearance

Nice to have

  • Familiarity with monitoring tools such as Prometheus and health checks
  • Experience coding with Java, Go and/or web technologies (e.g. HTML, CSS, JavaScript, Python/Ruby, Django/Flask/Ruby on Rails, etc.) is a plus
  • Track record of identifying bugs in codebases and contributing fixes leading to long term service stability
  • Demonstrated ability making data-driven decisions and engaging with stakeholders on strategy

What the JD emphasized

  • customer-facing outages
  • strong sense of urgency
  • US Security clearance