Product Reliability Engineer - Defense

Palantir Palantir · Enterprise · New York, NY · Product Support

Product Reliability Engineers (PREs) are responsible for the health, performance, and stability of services at Palantir, focusing on end-to-end service reliability, from outage response to code improvements and building lasting solutions. The role involves tackling critical issues, introducing observability, addressing tech debt, and informing strategic investments in core products, with a focus on forward-looking product work like infrastructure migrations and stability enhancements. PREs also participate in on-call rotations to respond to alerts and customer-reported issues.

What you'd actually do

  1. Continuously invest in documentation, metrics, monitors and other troubleshooting tools
  2. Participate in on-call rotations during business hours and occasional weekends. This is a challenging yet rewarding opportunity to help remediate the most pressing issues across the Palantir fleet.
  3. Diagnose, resolve, and prevent issues encountered in the field. Deliver end-to-end improvements to core products based on these issues you encounter in the field.
  4. Improve observability by refactoring codepaths and introducing telemetry
  5. Identify and implement data-driven opportunities for improved service resilience
  6. Develop strategic opinions on stability investments and inform the vision for long-term product stability

Skills

Required

  • Engineering background in Computer Science, Mathematics, Software Engineering, Physics or similar field
  • Experience producing code in backend languages such as Java, as part of a past role or personal projects
  • Familiarity with storage and data processing systems and cloud infrastructure
  • Strong written and verbal communication and ability to iterate quickly with teammates and incorporate feedback
  • Eligibility and willingness to obtain a US Security clearance

Nice to have

  • Comfortable with and curious about large scale production systems and technologies. For example, load balancing, monitoring, distributed systems, and configuration management.
  • Confidence in troubleshooting complex issues independently using observability tools and stack traces
  • Familiarity with monitoring tools such as Prometheus and health checks
  • Experience coding with Java, Go and/or web technologies (e.g. HTML, CSS, JavaScript, Python/Ruby, Django/Flask/Ruby on Rails, etc.) is a plus
  • Track record of identifying bugs in codebases and contributing fixes leading to long term service stability
  • Demonstrated ability making data-driven decisions and engaging with stakeholders on strategy

What the JD emphasized

  • high degree of ownership
  • strong sense of urgency
  • US Security clearance