Principal Site Reliability Engineer

Upstart · Fintech · Remote · Engineering

Upstart is seeking a Principal Site Reliability Engineer to own the reliability, resiliency, and observability of their production systems. This role involves building automation, tooling, and frameworks to ensure infrastructure health and scalability, defining technology operations risk strategy, and implementing disaster recovery planning. The Principal Engineer will act as a thought leader, driving adoption of SRE best practices, mentoring engineers, and influencing technical and business decisions. They will collaborate with various teams including Product Engineering, DevEx, Data Engineering, and Machine Learning to enhance operational excellence. Key responsibilities include leading SRE principle adoption, shaping reliability strategies, championing observability tools, building self-healing systems, improving incident response for ML systems, and driving cross-functional initiatives.

What you'd actually do

  1. Lead the definition, advocacy, and adoption of SRE principles across engineering teams
  2. Partner with leadership to shape long-term reliability, resiliency, and observability strategies
  3. Champion distributed tracing, real user monitoring (RUM), and key performance metrics such as Largest Contentful Paint (LCP) to improve system visibility and user experience
  4. Build and scale self-healing systems to minimize manual intervention and reduce downtime
  5. Drive enterprise-wide improvements to incident response processes, including those related to Machine Learning systems

Skills

Required

  • Python
  • Go
  • JavaScript/TypeScript
  • Infrastructure as Code (Terraform, CDK, CloudFormation, etc.)
  • observability
  • distributed tracing
  • RUM
  • LCP
  • performance monitoring tools (e.g., Datadog, Prometheus)
  • on-call and incident management
  • automation
  • building self-healing systems
  • LLM/GenAI to improve SRE efficiency and processes
  • program management skills

Nice to have

  • service mesh
  • Full stack development skills
  • building or extending observability platforms
  • Development Productivity or Quality Platforms
  • high-scale SaaS
  • microservice-oriented cloud environments

What the JD emphasized

  • Machine Learning systems
  • LLM/GenAI