Développeuse Ou Développeur En Fiabilité De Production / Production Reliability Engineer

Unity Unity · Enterprise · Montreal, QC · Engineering

This role is for a Production Reliability Engineer at Unity, focusing on building and operating a platform used across the company for service deployment and operation. The engineer will tackle complex multi-tenant infrastructure challenges, contribute to cloud infrastructure standardization, and collaborate with development and site reliability teams to improve services and deployment practices. Key responsibilities include managing Kubernetes clusters, implementing policies, cost allocation, SOX compliance, and enhancing platform features like secret management and security. The role requires strong experience in Kubernetes, cloud-native architecture, infrastructure-as-code, and GCP, with a preference for automation, observability, and well-documented systems.

What you'd actually do

  1. Relever des défis complexes d’infrastructure multilocataire — isolation des locataires, mise en application des politiques, répartition des coûts, conformité à la SOX et mise à l’échelle de clusters Kubernetes partagés pour soutenir une adoption grandissante.
  2. Contribuer à établir l’orientation de la standardisation de l’infrastructure infonuagique chez Unity.
  3. Collaborer avec des équipes de développement et de fiabilité des sites de calibre mondial pour améliorer les applications de service, l’infrastructure et les pratiques de déploiement.
  4. Développer vos propres compétences tout en voyant les solutions sur lesquelles vous travaillez évoluer et avoir un impact concret.

Skills

Required

  • Kubernetes
  • cloud-native architecture
  • infrastructure-as-code
  • production infrastructure operation
  • multi-tenant platforms
  • GCP
  • cloud provider tradeoffs
  • platform feature delivery
  • secret management
  • policy enforcement
  • deployment pipelines
  • cost allocation
  • security hardening
  • technical influence
  • clear communication
  • technical proposals
  • technical discussions
  • resilience improvement
  • best practices sharing
  • reliability principles
  • automation
  • repeatable systems
  • observable systems
  • documented systems

Nice to have

  • Golang
  • Python
  • Node.js
  • Helm
  • Kustomize
  • ArgoCD
  • Terraform
  • Vault
  • IAC best practices
  • GKE
  • Cloud SQL
  • IAM
  • networking
  • BigQuery
  • Docker
  • containerization best practices
  • GitHub Actions
  • CI/CD pipeline design
  • cloud networking
  • DNS
  • TLS certificate management
  • multi-cloud infrastructure patterns
  • FinOps
  • cloud cost optimization

What the JD emphasized

  • conformité à la SOX