Staff Software Engineer, Scaling AI Systems

Abridge Abridge · Vertical AI · San Francisco, CA · Builder

Abridge is seeking a Staff Software Engineer to join their scaling AI systems team. This role focuses on improving the performance, stability, and scalability of their software, with an 80% software and 20% cloud infrastructure focus. Responsibilities include load testing, chaos engineering, identifying and resolving performance bottlenecks, rehoming applications onto scalable platforms, building developer tools, and improving observability and incident response. The ideal candidate has 10+ years of experience in distributed systems or tooling, backend engineering, and scaling compute services on Kubernetes, with experience in GCP.

What you'd actually do

  1. Leverage load testing, chaos engineering, and other test practices to identify performance and latency bottlenecks across all of our systems, and make changes to application code to resolve them.
  2. Drive software changes that can rehome applications at the code level onto new infrastructure (run times, event driven infrastructure, databases, and more) in order to dramatically improve scalability as well as enable multi-tenant deployments.
  3. Identify and implement software configuration changes and performance tuning parameters that will dramatically improve performance and scalability.
  4. Build developer tools and software modules that help engineers build code faster and more effectively with more enablements to the entire engineering organization.
  5. Work with the Platform team to develop, and application teams to adopt, emerging elements of our internal developer platform, such as service templates and self-serve infrastructure.

Skills

Required

  • 10+ years of software engineering experience focused on distributed systems or tooling
  • Experience as a back-end engineer focused on system performance and scalability
  • Experience reducing latency in software by multiples through leveraging observability and profiling tools
  • Experience building on Kubernetes and scaling compute services on Kubernetes
  • Experience implementing and securing services in Google Cloud Platform with Infrastructure as Code
  • Experience building software with backend languages (e.g. Python, GoLang, Node, and Rust)
  • Experience monitoring distributed systems with Prometheus, OpenTelemetry Collector, and Grafana

Nice to have

  • experience with related cloud native technologies including ArgoCD, Argo Rollouts, Istio
  • GCP experience

What the JD emphasized

  • performance
  • scalability
  • distributed systems
  • hyperscale

Other signals

  • hyperscale
  • distributed systems
  • performance
  • scalability
  • cloud infrastructure