Software Engineer 2

Abnormal AI · Vertical AI · Bangalore, India · Hybrid · Platform & Infrastructure

Software Engineer 2 role focused on building and operating the observability platform (Prometheus, Chronosphere, Grafana) and data infrastructure (Airflow, Spark) for an enterprise AI company. Responsibilities include designing, developing, and deploying platform features, ensuring reliability, performance, and cost-efficiency of shared infrastructure, and participating in incident response. Requires strong backend engineering, distributed systems, Python, and Golang experience, with a focus on monitoring, alerting, and observability principles.

What you'd actually do

  1. Own the observability stack (Prometheus, Chronosphere, Grafana, PagerDuty) that every team relies on to detect, diagnose, and resolve production issues — when you make it better, every engineer at Abnormal gets faster.
  2. Design platforms and developer tooling that remove friction — reducing deployment times, simplifying pipeline authoring, and letting product teams focus on building rather than firefighting.
  3. Drive SLAs and SLOs for critical shared infrastructure ensuring the systems behind our products are resilient and cost-efficient.
  4. Your architectural decisions on alerting pipelines and cross-environment deployments will define what products we can build and how quickly we deliver them to customers.
  5. Own features end-to-end: scoping, implementation, testing, deployment, and post-launch monitoring across multiple environments (US, EU, GovCloud)

Skills

Required

  • Backend Engineering & Distributed Systems (4+ years)
  • Python
  • Golang
  • systems that process data at scale
  • owning a service or platform end-to-end
  • balancing feature development with operational responsibilities
  • writing technical design documents
  • breaking down ambiguous problems
  • fault tolerance patterns
  • incident response capability
  • testing discipline
  • design systems with a forward-looking perspective
  • cross-team technical direction
  • Async-first communication excellence
  • Proactive communicator
  • monitoring, alerting, and observability principles

Nice to have

  • Prometheus
  • Grafana
  • Chronosphere
  • Datadog
  • New Relic
  • Honeycomb

What the JD emphasized

  • Python
  • Golang
  • fault tolerance patterns
  • monitoring, alerting, and observability principles