Senior Site Reliability Engineer

Carta Carta · Fintech · San Francisco, CA · Site Reliability

Senior Site Reliability Engineer at Carta, focusing on building and scaling internal platform offerings (compute, storage, networking) to ensure application reliability and performance. Responsibilities include designing and implementing monitoring, alerting, and incident response systems, collaborating with software engineers, and driving system improvements. The role requires extensive experience with cloud platforms (AWS, GCP, Azure), Infrastructure as Code (Terraform, Ansible), networking, and monitoring tools (Prometheus, Grafana, Datadog). Proficiency in Python is expected, along with experience in API services. The role also emphasizes using and building AI tools to reduce toil.

What you'd actually do

  1. Build and scale our internal platform offerings (compute, storage and networking services) to ensure the reliability, and performance of our applications.
  2. Design and implement monitoring, alerting, and incident response systems.
  3. Collaborate with application software engineers (as needed) to guide their design and ensure it scales for what Carta needs in the long run.
  4. Act as an agent of change and push boundaries to incrementally improve our systems as we expand globally.

Skills

Required

  • AWS, Google Cloud Platform, or Azure
  • Kubernetes or other container orchestration
  • Terraform, Ansible, or CloudFormation
  • Container Network Interface (CNI), Network policy implementations
  • Prometheus, Grafana, ELK Stack, or Datadog
  • Python
  • API services
  • AI tools
  • building agents

Nice to have

  • proxies and service mesh
  • CI/CD and its associated best practices

What the JD emphasized

  • extensive experience with cloud services such as AWS, Google Cloud Platform, or Azure
  • Experience with Kubernetes or other container orchestration is preferred!
  • Proficient in using tools such as Terraform, Ansible, or CloudFormation
  • Experience with networking concepts and tools, including Container Network Interface (CNI), Network policy implementations.
  • Experience with proxies and service mesh is a big plus.
  • Strong knowledge of monitoring tools and practices, such as Prometheus, Grafana, ELK Stack, or Datadog
  • Proficiency in Python, with the ability to write efficient, maintainable, and scalable code.
  • Experience in designing, deploying, and maintaining API services
  • You use AI tools in your own day-to-day work in addition to enabling others.
  • You're comfortable building agents to reduce toil and expect this to be a normal part of how you operate.