Principal Site Reliability Engineer

Oracle Oracle · Enterprise · Mexico

Principal Site Reliability Engineer responsible for owning and improving the end-to-end reliability metrics, designing and implementing high-availability architectures, and architecting automation tools for OCI services.

What you'd actually do

  1. Lead the design, automation, and support of OCI services with a focus on resiliency, security, scalability, and performance.
  2. Own and improve the end-to-end reliability metrics (SLOs, SLAs, KPIs) for your services.
  3. Design and implement high-availability architectures and standards for large-scale distributed systems.
  4. Serve as the ultimate escalation point for complex operational issues, using a deep understanding of service topologies and interdependencies.
  5. Architect and build automation and orchestration tools that reduce manual work and prevent problem recurrence.

Skills

Required

  • Linux systems administration
  • Python
  • Bash/Shell scripting
  • distributed systems
  • networking
  • service architecture
  • databases
  • CI/CD pipelines
  • Agile methodologies
  • DevOps best practices
  • unit tests
  • production-grade software
  • technical problem-solving

Nice to have

  • monitoring tools
  • observability tools
  • Oracle Cloud Infrastructure (OCI)
  • AWS
  • Azure
  • GCP
  • Infrastructure-as-Code
  • Terraform
  • Ansible
  • Kubernetes

What the JD emphasized

  • This is not a fully remote role but a hybrid role. Does require in office at least 3 days a week in Guadalajara.