Staff Infrastructure Software Engineer (kubernetes)

Cresta Cresta · Vertical AI · Germany, Romania · Remote · Engineering

Staff Infrastructure Software Engineer responsible for designing, building, and advancing core infrastructure, including multi-cloud Kubernetes clusters, developer toolchains, and automation. The role specifically involves building machine learning infrastructure to support AI teams in training, testing, and deploying models on large-scale datasets, with a bonus for GPU-enabled cluster experience.

What you'd actually do

  1. Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.
  2. Ensure reliability of multi-cloud Kubernetes clusters and pipelines.
  3. Metrics, logging, analytics, and alerting for performance and security across all endpoints and applications.
  4. Infrastructure-as-code deployment tooling and supporting services on multiple cloud providers.
  5. Automate operations and engineering. Focus on automation so we can spend energy where it matters.
  6. Building machine learning infrastructure that enables AI teams to train, test, and deploy on large-scale datasets.

Skills

Required

  • DevOps
  • Site Reliability Engineering
  • Production Engineering
  • Golang
  • Python
  • container-related security best practices
  • Kubernetes
  • Helm
  • Kustomize
  • Terraform
  • CloudFormation
  • AWS
  • IAM
  • S3
  • EC2
  • EKS
  • PostgreSQL
  • GitOps
  • Flux
  • Argo
  • CI/CD
  • GitHub Actions

Nice to have

  • GPU-enabled clusters
  • Google Cloud
  • Azure

What the JD emphasized

  • 5+ years experience in DevOps, Site Reliability Engineering, Production Engineering, or equivalent field.
  • Deep proficiency with coding languages such as Golang or Python.
  • Deep familiarity with container-related security best practices.
  • Production experience working with Kubernetes, and a deep understanding of the Kubernetes ecosystem, including popular open-source tooling such as cert-manager or external-dns.
  • Production experience with Kubernetes templating tools such as Helm or Kustomize.
  • Production experience with IAC tools such as Terraform or CloudFormation.
  • Production experience working with AWS and services such as IAM, S3, EC2, and EKS.
  • Production experience with other cloud providers such as Google Cloud and Azure is a bonus.
  • Production experience with database software such as PostgreSQL
  • Experience with GitOps tooling such as Flux or Argo.
  • Experience with CI/CD such as GitHub Actions.

Other signals

  • building machine learning infrastructure
  • enables AI teams to train, test, and deploy
  • large-scale datasets
  • multi-cloud Kubernetes clusters
  • GPU-enabled clusters is a bonus