Senior Backend Engineer, Langsmith Deployments

LangChain LangChain · Data AI · San Francisco, CA · Engineering

LangChain is seeking a Senior Backend Engineer to work on LangSmith Deployments, the runtime infrastructure for running AI agents in production. This role involves designing and scaling durable execution runtimes for long-running, fault-tolerant agents with features like checkpointing, orchestration, and horizontal scaling. The engineer will focus on backend development, distributed systems, and infrastructure, with a strong preference for Kubernetes and DevOps tooling.

What you'd actually do

  1. Design distributed queue and worker systems that handle concurrent agent execution, background tasks, and multi-agent coordination across horizontally scalable infrastructure
  2. Own core data infrastructure — state persistence, atomic job claiming, connection management, and schema evolution
  3. Collaborate on architectural decisions, ensuring solutions are scalable and robust.
  4. Ship resumable streaming infrastructure so clients can disconnect and reconnect mid-execution without losing state
  5. Instrument and monitor production systems — tracing, metrics, and alerting to keep the platform healthy

Skills

Required

  • 4+ years of professional backend engineering experience
  • Strong proficiency in Go and/or python
  • Experience with distributed systems — conensus mechanisms, queueing, state machines, and/or workflow orchestration
  • Experience with scaling and sharding databases in high throughput environments
  • Strong communication skills and ability to work cross-functionally on a small team

Nice to have

  • Strong familiarity with Kubernetes (K8s), Terraform (Tf), and other DevOps tooling is highly preferred
  • Familiarity with Kubernetes, infrastructure-as-code, and at least one major cloud platform

What the JD emphasized

  • LangSmith Deployments is the runtime that makes this work
  • durable checkpointing, fault-tolerant orchestration, and horizontal scaling
  • resumable streaming infrastructure
  • distributed systems
  • scaling and sharding databases

Other signals

  • LangSmith Deployments is the runtime that makes this work
  • durable checkpointing, fault-tolerant orchestration, and horizontal scaling
  • Design distributed queue and worker systems that handle concurrent agent execution
  • Own core data infrastructure — state persistence, atomic job claiming, connection management, and schema evolution
  • Ship resumable streaming infrastructure
  • Instrument and monitor production systems