Site Reliability Engineer

Clay · Vertical AI · New York, NY · Engineering

Site Reliability Engineer at Clay, a company that helps organizations turn growth ideas into reality using data, signals, and AI research. The role focuses on building and fine-tuning infrastructure to ensure smooth service operation, with an emphasis on automation and continuous improvement. Responsibilities include architecting scalable infrastructure, managing cloud resources for availability and cost-efficiency, implementing monitoring and alerting, and leading incident response. Coding skills are essential, and the role involves participating in an on-call rotation and collaborating with various teams to balance developer velocity, reliability, and cost.

What you'd actually do

  1. Architect, design, implement, and manage robust, scalable, and secure infrastructure solutions.
  2. Develop, maintain, and enforce best practices for CI/CD, infrastructure as code, and automation.
  3. Oversee the management and optimization of cloud infrastructure, ensuring high availability, performance, and cost-efficiency.
  4. Implement monitoring, logging, and alerting solutions to maintain system health and quickly resolve issues.
  5. Lead incident response efforts, troubleshooting and resolving complex issues in a timely manner.

Skills

Required

  • 5+ years of experience
  • Experience with containerization and orchestration tools
  • Strong understanding of CI/CD concepts and tools
  • Knowledge of infrastructure automation tools
  • Experience with oncall and incident response
  • Proficiency in one or more programming languages
  • Typescript
  • Python

Nice to have

  • Aurora Postgres RDS
  • Elasticache Redis
  • Docker + ECS
  • Lambda
  • OpenSearch
  • Terraform and Atlantis
  • CircleCI
  • Netlify
  • Playwright
  • Cloudwatch
  • Datadog
  • Mezmo