Senior Software Engineer, Scaling AI Systems

Abridge · Vertical AI · San Francisco, CA · Builder

Abridge is seeking a Senior Software Engineer to join their scaling AI systems team. This role focuses on improving the performance, stability, and scalability of their AI-powered healthcare platform. Responsibilities include load testing, chaos engineering, identifying and resolving performance bottlenecks, driving software changes for scalability, building developer tools, and improving incident response through enhanced observability. The role is primarily software-focused with a portion on cloud infrastructure, working with distributed systems and cloud-native technologies on GCP.

What you'd actually do

  1. Leverage load testing, chaos engineering, and other test practices to identify performance and latency bottlenecks across all of our systems, and make changes to application code to resolve them.
  2. Drive software changes that can rehome applications at the code level onto new infrastructure (run times, event driven infrastructure, databases, and more) in order to dramatically improve scalability as well as enable multi-tenant deployments.
  3. Identify and implement software configuration changes and performance tuning parameters that will dramatically improve performance and scalability.
  4. Build developer tools and software modules that help engineers build code faster and more effectively with more enablements to the entire engineering organization.
  5. Work with the Platform team to develop, and application teams to adopt, emerging elements of our internal developer platform, such as service templates and self-serve infrastructure.

Skills

Required

  • 8+ years of software engineering experience focused on distributed systems or tooling, with an interest in engineering enablement and software scaling.
  • At least 2 years experience as a back-end engineer focused on system performance and scalability.
  • Experience reducing latency in software by multiples through leveraging observability and profiling tools and deriving great pleasure from doing so.
  • Experience building on Kubernetes and scaling compute services on Kubernetes; experience with related cloud native technologies including ArgoCD, Argo Rollouts, Istio, etc.
  • Comfortable implementing and securing services in Google Cloud Platform with Infrastructure as Code, including GCP Projects, VPC Networks, Google Kubernetes Engine, and IAM Roles, Groups and policies.
  • Experience building software with backend languages (e.g. Python, GoLang, Node, and Rust).
  • Experience monitoring distributed systems with Prometheus, OpenTelemetry Collector, and Grafana (or something similar), including metrics collection, visualization, alerting, and using observability data to drive performance optimizations.

Nice to have

  • GCP experience
  • Kubernetes
  • ArgoCD
  • Argo Rollouts
  • Istio
  • Python
  • GoLang
  • Node
  • Rust
  • Prometheus
  • OpenTelemetry Collector
  • Grafana

What the JD emphasized

  • hyperscale
  • performance
  • scalability
  • distributed systems
  • cloud infrastructure
  • observability
  • developer tools
  • internal developer platform
  • SLOs
  • incident response

Other signals

  • hyperscale
  • distributed systems
  • performance
  • scalability
  • cloud infrastructure
  • observability
  • developer tools
  • internal developer platform
  • SLOs
  • incident response