Staff Software Engineer - Platform & Infrastructure

Abnormal AI · Vertical AI · United States · Remote · Platform & Infrastructure

Staff Software Engineer role focused on building and operating the core platform infrastructure (compute, orchestration, data platform) that powers Abnormal's AI/ML products at scale. The role involves shaping the roadmap, designing architecture, and championing AI-native software development practices to enable a self-service infrastructure platform.

What you'd actually do

  1. Shape the core areas of Platform Infrastructure such as compute (EC2/EKS, autoscaling, container runtime) and orchestration (Kubernetes, workload APIs, multi-cluster, policy/quotas), as well as data platform (streaming, batch, durable storage, data tooling)—with demonstrated depth in _at least two_ of these.
  2. Design and drive platform architecture & roadmap to support Abnormal’s expanding AI/ML portfolio—scaling seamlessly across services, tenants, and regions.
  3. Partner deeply with product & ML workflows to make pragmatic trade-offs, accelerating our shift to a platform-first operating model and enabling self-service.
  4. Raise the bar on operational excellence (SLOs, availability, performance, incident response, change management, on-call hygiene) and help teams consistently meet it.
  5. Act as the team’s technical lead: define quarterly roadmaps, de-risk delivery, mentor engineers, and land high-leverage, cross-team initiatives.

Skills

Required

  • building and scaling data-intensive, distributed backend systems
  • platforms, tools, or infrastructure
  • self-service platform offerings
  • Python
  • Golang
  • Terraform/Terragrunt
  • PostgreSQL
  • Kafka
  • Redis
  • OpenSearch
  • AWS
  • Kubernetes
  • IaC
  • observability
  • SRE fundamentals (SLOs, error budgets, incident management, postmortems, capacity planning)

Nice to have

  • multi-tenant or regulated (e.g., FedRAMP-like) platforms, isolation boundaries, and guardrails
  • feature stores, offline/online consistency
  • model serving
  • evaluation/feedback loops
  • cross-org migrations (e.g., to Kubernetes, event-driven architectures, or a unified data platform)

What the JD emphasized

  • self-service infrastructure platform
  • platform-first operating model
  • AI/ML products at scale
  • AI/ML portfolio
  • AI-native software development

Other signals

  • AI/ML products at scale
  • AI/ML portfolio
  • AI-native software development
  • platform-first operating model
  • self-service infrastructure platform