Sr Cloud Platform Engineer

1Password 1Password · Enterprise · United States, Canada · Remote · Technology

This role builds and maintains the infrastructure platform supporting 1Password, focusing on AWS, Kubernetes, Infrastructure as Code, and observability. The engineer will design self-service tools, automate operational tasks, implement monitoring, participate in on-call rotations, and scale infrastructure to handle growing traffic.

What you'd actually do

  1. Design and implement self-service tools that let product teams deploy services without infrastructure tickets or manual provisioning.
  2. Identify repetitive manual work and build automation to eliminate it.
  3. Implement monitoring, alerting, and dashboards that help teams sleep soundly knowing their services are healthy. Build systems that detect problems before users do, and make debugging production issues straightforward whether it's 3pm or 3am.
  4. Join the on-call rotation responding to infrastructure incidents. You'll work to reduce incident frequency through better automation and resilience patterns.
  5. Plan capacity, optimize performance, and ensure our platform handles growing traffic without degradation. You'll work on problems like reducing deployment times, improving resource utilization, and maintaining sub-100ms p99 latencies.

Skills

Required

  • 5+ years working with distributed systems and microservices in production environments
  • Strong AWS experience – You know EC2, ECS/EKS, VPC networking, IAM, and can architect multi-AZ resilient systems
  • Infrastructure as Code fluency – Daily experience with Terraform or CloudFormation. You think in code, not clickops
  • Programming skills for automation – Comfortable writing Go, Python, or similar languages to build tools and automation
  • Kubernetes multi-tenancy production experience – You've deployed, scaled, and debugged containerized workloads in multi-tenanted production clusters
  • Observability expertise – Hands-on experience with Prometheus, Grafana, Datadog, or equivalent. You know what to monitor and how to alert effectively
  • Incident response experience – You've been on-call, resolved outages, and written postmortems that led to systemic improvements
  • Security-minded approach – You default to least-privilege, encrypt at rest and in transit, and think about threat models

Nice to have

  • GitOps experience with FluxCD and Kustomize
  • Service mesh experience (Istio, Linkerd, Consul)
  • Cost optimization experience in cloud environments
  • Open source contributions to infrastructure tooling
  • Experience with compliance frameworks (SOC 2, ISO 27001) and policy as code (Kyverno)

What the JD emphasized

  • AI agents
  • AI tools