Site Reliability Engineer

Dropbox Dropbox · Enterprise · Mexico · CorpEng (Sub Team)

Site Reliability Engineer at Dropbox responsible for leading infrastructure strategy, technical direction, and ensuring the reliability, scalability, and performance of infrastructure and services. This role involves building and maintaining automation, infrastructure-as-code tooling, and managing container orchestration platforms, monitoring, and logging pipelines. The engineer will also drive improvement projects related to service health and visibility, and develop custom tooling.

What you'd actually do

  1. Ensure the reliability, scalability, and performance of Dropbox's infrastructure and services
  2. Collaborate with cross-functional teams to develop and maintain best practices for monitoring, logging, and incident response
  3. Build, Implement and maintain automations & infrastructure-as-code tooling, specifically Terraform, Ansible, and Github Actions as well as custom code platforms
  4. Utilize container orchestration platforms, such as Kubernetes, Amazon ECS and Red Hat Openshift, to manage containers at scale
  5. Manage and optimize monitoring and logging pipelines using tools like Datadog and Cribl LogStream

Skills

Required

  • site reliability engineering
  • coding experience
  • AWS services
  • Linux administration
  • monitoring tools
  • logging tools
  • scripting languages (Python)
  • automation
  • infrastructure-as-code
  • configuration management
  • containerization
  • container orchestration
  • LDAP
  • REST API's
  • Auth
  • GitHub
  • Git-based workflows
  • RDS databases
  • network security
  • problem-solving skills
  • communication skills

Nice to have

  • large-scale multi-cloud or hybrid infrastructure management
  • GitOps workflows
  • Kubernetes
  • Docker
  • serverless platforms
  • observability
  • reliability
  • incident response
  • compliance and security frameworks (SOC2, ISO 27001, FedRAMP)
  • Zero Trust security
  • access models

What the JD emphasized

  • 5+ years of experience in site reliability engineering or a similar engineering roles with hands-on coding experience
  • Strong knowledge of AWS services
  • Strong knowledge of Linux administration
  • Experience with monitoring and logging tools
  • Experience driving one or more transformational programs related to metrics and observability
  • Experience with scripting in a higher level language (Python preferred)
  • Experience developing automation to solve infrastructure-related tasks
  • Experience with log analysis and building metrics, alerts and visuals from log data
  • Strong proficiency in infrastructure-as-code tools
  • Strong Proficiency in Config Management tools
  • Experience with containerization technologies
  • Understanding of compliance and security frameworks (SOC2, ISO 27001, FedRAMP)