Software Architect, Reliability Engineering

Twilio Twilio · Enterprise · Ireland · Remote · Engineering

Twilio is seeking a Reliability Architect to drive the technical strategy, vision, and outcomes for their Reliability Engineering organization. This role involves defining and leading solutions to ensure Twilio products are reliable worldwide, setting standards, and guiding engineering teams on best practices for designing, building, and operating resilient systems. The architect will influence company-wide architectural decisions, focusing on availability, performance, resilience, and cost efficiency using Kubernetes, AWS, Terraform, and observability tools. Responsibilities include ensuring integrity across the service lifecycle, designing fault-tolerant architectures, managing incidents and capacity, and collaborating with product teams to identify and mitigate reliability risks. The role also involves mentoring engineers and staying updated on SRE and cloud best practices.

What you'd actually do

  1. Partner with senior technical leaders across Twilio to set and communicate the reliability strategy, translating business goals into measurable outcomes.
  2. Influence company-wide architectural decisions while balancing long-term vision with near-term and compliance needs.
  3. Lead the design, implementation, and operation of scalable solutions and paved roads that enable reliable, high-traffic services;
  4. Influence company-wide architectural decisions to focus on availability, performance, resilience, and cost efficiency using Kubernetes, AWS, Terraform, and modern observability.
  5. Ensure integrity and quality across the service lifecycle; design fault-tolerant architectures, incident response, disaster recovery, and capacity/cost management.

Skills

Required

  • Reliability Engineering
  • Software Engineering
  • DevOps
  • infrastructure
  • backend systems
  • reliability
  • principal/architect experience
  • strategic technical decisions
  • long-term technical vision
  • SaaS organization reliability
  • cross-org technical architecture
  • cloud architecture
  • devops practices
  • large-scale systems design
  • microservices
  • production experience
  • operational management
  • scaling
  • partitioning strategies
  • performance tuning
  • reliability tuning
  • high-scale environments
  • Kubernetes (e.g., EKS)
  • deploying and managing stateful services
  • AWS services
  • infrastructure-as-code tools (Terraform or CloudFormation)
  • observability tools (Prometheus, Grafana, Datadog)
  • monitoring distributed systems
  • setting up alerting
  • programming language (Go, Python, Java)
  • building automation and tooling
  • incident response processes
  • SLOs/SLIs
  • runbooks
  • on-call rotations
  • cross-functional post-incident reviews
  • distributed systems principles
  • consensus
  • durability
  • throughput
  • availability tradeoffs
  • leading reliability improvements
  • data-intensive systems
  • mission-critical systems
  • collaborating with engineering teams
  • problem-solving
  • analytical skills
  • verbal communication
  • written communication
  • cross-functional environments
  • distributed environments
  • mentoring teams
  • influencing decisions
  • balancing long-term objectives with short-term needs
  • building effective working relationships

Nice to have

  • owning and operating large AWS footprints
  • Kubernetes architecture and concepts
  • data technologies (Apache Kafka, AWS MSK)
  • reliable streaming
  • building reliable products
  • high-availability systems

What the JD emphasized

  • 15+ years of experience in Reliability Engineering, Software Engineering, DevOps roles with a focus on infrastructure, backend systems, and reliability, including as a principal/architect.
  • Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability in high-scale environments.
  • Proven track record of leading reliability improvements in data-intensive or mission-critical systems and collaborating with engineering teams.