Sr. Site Reliability Engineer

PitchBook PitchBook · Fintech · Seattle, WA · Technology Operations

PitchBook is seeking a Sr. Site Reliability Engineer to create and evolve systems for reliable and consistent operation of their product suite. The role involves defining SLOs, building systems to meet them, deploying and managing production systems, working with developers on large-scale services, incorporating observability tools, performing incident management, root cause analysis, eliminating single points of failure, building reliability and redundancy, establishing recoverability, mitigating failures through automation, and mentoring other engineers. The role requires independent responsibility for building and managing system subsets, best practices for infrastructure-as-code, and collaboration with colleagues.

What you'd actually do

  1. creating and evolving systems to automatically run our suite of products and services reliably and consistently
  2. help define service level objectives (SLOs) that determine success and build systems to achieve those objectives
  3. utilize your strong background in deploying, managing, and maintaining production systems, working with developers to operate and monitor large-scale services with complex distributed systems and data integrations
  4. incorporate observability tools (monitoring, telemetry, tracing, alerting), perform incident management, conduct root cause analyses, eliminate single points of failure, build reliability and redundancy into our infrastructure, establish and test our recoverability, mitigate failures, and do all of these things through automation and tools
  5. take independent responsibility for building and managing large subsets of our systems

Skills

Required

  • Linux/UNIX-based systems
  • cloud environments (GCP & AWS)
  • Reliability Engineering, DevOps, or infrastructure role
  • infrastructure-as-code tools (e.g. Terraform, Puppet, Ansible, Chef)
  • containers and orchestration platforms, including Kubernetes and Docker
  • infrastructure systems, networking, and security
  • operational reliability, scalability, recoverability (backups, disaster recovery, failover), and capacity planning
  • operational activities including batch processing, system backups, maintenance, monitoring, and providing first-tier on-call support and being part of a 24/7 response team
  • distributed, scalable microservices and event-driven architectures
  • data storage, replication, caching, and search technologies, such as PostgreSQL, MySQL, MS SQL Server, Amazon RDS, GCP CloudSQL, Redis, Elasticsearch, and Lucene/Solr
  • professional certification in AWS or GCP (DevOps or SysOps Engineer preferred)
  • Microsoft Office suite including in-depth knowledge of Outlook, Word, and Excel with the ability to pick up new systems and software easily

Nice to have

  • Master's degree

What the JD emphasized

  • critical to your success