Senior Team Leader, Data Reliability Engineering

Redfin Redfin · Seattle · MI · Remote

Lead a Data Reliability Engineering team focused on the reliability, observability, operational maturity, and trustworthiness of enterprise data platforms, including databases, data warehouses, pipelines, and storage. This role involves defining standards, creating metrics, developing roadmaps, and supporting modernization efforts, with a focus on regulated environments.

What you'd actually do

  1. Lead an Engineering team responsible for improving the reliability, observability, recoverability, and operational maturity of enterprise data platforms.
  2. Define reliability standards for databases, data warehouses, pipelines, jobs, storage, access patterns, and supporting infrastructure.
  3. Establish operating expectations for monitoring, alerting, logging, incident response, change management, backup/recovery, disaster recovery, patching, access controls, service ownership, and operational readiness.
  4. Create metrics that measure platform health, data freshness, data quality, recovery readiness, incident trends, operational risk, compliance alignment, and business impact.
  5. Lead current-state assessments of systems, data flows, operational processes, observability, access patterns, and reliability gaps.

Skills

Required

  • Leadership experience (5+ years)
  • Data infrastructure experience (10+ years)
  • Reliability engineering principles
  • Observability (monitoring, alerting, logging)
  • Incident response
  • Change management
  • Backup and recovery strategies
  • Disaster recovery planning
  • Data quality assessment
  • Security and compliance understanding
  • Ability to create structure and set priorities
  • Influence across teams

Nice to have

  • Experience in financial services, mortgage, banking, lending, insurance, or other regulated enterprise environments
  • Experience with AWS, Snowflake, Microsoft SQL Server, Postgres, Redshift, Aurora
  • Experience with on-premise to cloud migrations or data platform modernization
  • Experience defining data reliability practices (lineage, reconciliation, data incident management)
  • Experience leading senior technical talent (Staff/Principal Engineers)

What the JD emphasized

  • reliability
  • observability
  • operational maturity
  • trustworthiness
  • enterprise data platforms
  • reliability standards
  • operating expectations
  • monitoring
  • alerting
  • logging
  • incident response
  • change management
  • backup/recovery
  • disaster recovery
  • patching
  • access controls
  • service ownership
  • operational readiness
  • metrics
  • platform health
  • data freshness
  • data quality
  • recovery readiness
  • incident trends
  • operational risk
  • compliance alignment
  • business impact
  • assessments
  • systems
  • data flows
  • operational processes
  • observability
  • access patterns
  • reliability gaps
  • executable roadmaps
  • platform stability
  • data trust
  • security alignment
  • operational predictability
  • migration and modernization
  • on-premise platforms
  • AWS
  • Snowflake
  • enterprise data systems
  • durable operating mechanisms
  • reliability reviews
  • service health reviews
  • incident reviews
  • operational readiness reviews
  • risk reviews
  • roadmap reviews
  • executive reporting
  • senior technical talent
  • leadership structure
  • Data Reliability Engineering
  • 10+ years of experience
  • data infrastructure
  • database engineering
  • data platform engineering
  • cloud infrastructure
  • site reliability engineering
  • 5+ years of experience leading engineering teams
  • production systems
  • databases
  • data platforms
  • infrastructure platforms
  • reliability engineering
  • enterprise data infrastructure
  • databases
  • data warehouses
  • pipelines
  • storage
  • compute
  • backup/recovery
  • resiliency
  • production operations
  • improving reliability practices
  • complex production environments
  • observability
  • monitoring
  • incident response
  • change management
  • disaster recovery
  • lifecycle management
  • service health metrics
  • data reliability metrics
  • operational maturity indicators
  • executive-level reporting
  • enterprise security
  • compliance
  • access management
  • auditability
  • operational controls
  • infrastructure standards
  • create structure in ambiguous environments
  • set clear priorities
  • influence across teams
  • translate technical reliability work into business outcomes
  • financial services
  • mortgage
  • banking
  • lending
  • insurance
  • regulated enterprise environments
  • mergers
  • acquisitions
  • integrations
  • large-scale enterprise transformation
  • AWS
  • Snowflake
  • Microsoft SQL Server
  • Postgres
  • Redshift
  • Aurora
  • data platform technologies
  • on-premise to cloud migrations
  • data platform modernization
  • large-scale infrastructure transformation
  • data freshness
  • data quality
  • lineage
  • reconciliation
  • pipeline observability
  • data incident management
  • leading senior technical talent
  • Staff or Principal Engineers