Staff Site Reliability Engineer

Visa Visa · Fintech · Austin, TX

Staff Site Reliability Engineer (Azure) responsible for designing, building, and evolving cloud-native, containerized infrastructure on Microsoft Azure that powers data products and services. Focuses on platform maturity, supporting cross-functional squads, leading technical initiatives, and ensuring availability, security, scalability, and reliability of the data ecosystem. Requires deep expertise in Azure cloud architecture, infrastructure implementation, systems design, networking, databases, and modern data technologies, with hands-on experience in complex technology adoption, infrastructure automation, and high-scale distributed systems.

What you'd actually do

  1. designing, building, and evolving cloud-native, containerized infrastructure on Microsoft Azure that powers our data products and services
  2. supporting cross-functional squads, leading complex technical initiatives, and ensuring the availability, security, scalability, and reliability of our data ecosystem
  3. architecting, implementing, and optimizing Azure-based platforms and services, including cloud networking, compute, storage, identity and access management, observability, and container orchestration
  4. leading the design and delivery of enterprise-grade cloud solutions using Azure-native and hybrid-cloud patterns
  5. driving best practices for reliability, security, and operational excellence across the data platform

Skills

Required

  • Azure cloud architecture
  • Azure infrastructure implementation
  • systems design
  • networking
  • databases
  • modern data technologies
  • complex technology adoption
  • infrastructure automation
  • high-scale distributed systems
  • secure, resilient, and scalable solutions
  • cloud networking
  • compute
  • storage
  • identity and access management
  • observability
  • container orchestration
  • enterprise-grade cloud solutions
  • reliability
  • security
  • operational excellence
  • Infrastructure as Code (Terraform)
  • Kubernetes
  • CI/CD systems
  • pipeline design
  • automation
  • secure deployment practices
  • distributed systems
  • reliability engineering principles
  • SLOs
  • error budgets
  • incident response
  • SQL
  • NoSQL
  • data storage patterns
  • observability stacks
  • Bash
  • Python
  • Ansible-like tools
  • version control
  • testing
  • code reviews
  • design patterns
  • on-call processes
  • incident management
  • post-incident reviews
  • technical documentation
  • architectural proposals
  • decision records

Nice to have

  • AWS
  • Computer Science
  • Engineering

What the JD emphasized

  • Advanced experience designing and operating large‑scale, cloud‑native infrastructure (AWS preferred)
  • Strong hands-on proficiency with Infrastructure as Code (Terraform)
  • Deep understanding of Kubernetes and container orchestration
  • Strong competencies in systems design, networking, distributed systems, and reliability engineering principles (SLOs, error budgets, incident response)
  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar)
  • Proven ability to lead complex, cross-functional technical initiatives from design to production rollout
  • Demonstrated experience driving technology adoption and platform modernization across multiple teams
  • Experience participating in and improving on-call processes, incident management, and post-incident reviews