Member of Technical Staff, AI Reliability & Monitoring Engineering Lead

Postman Postman · Enterprise · San Francisco, CA · AI

This role focuses on ensuring the reliability, scalability, and performance of Postman's AI-powered API and agentic systems in production. It involves developing and managing reliability metrics (SLOs), implementing observability and monitoring, designing automated failover and recovery strategies, optimizing resource utilization (especially GPU/accelerator efficiency), and leading efforts to build internal tooling for AI system stability. The role requires experience in AI reliability engineering, SRE, or DevOps for distributed systems, with a strong understanding of AI system challenges and integration of AI-specific metrics into reliability frameworks.

What you'd actually do

  1. Develop and manage reliability metrics (SLOs) for AI-driven API services and agentic AI platform features
  2. Implement comprehensive observability and monitoring systems for real-time performance and fault detection
  3. Design and drive automated failover, recovery, and incident response strategies for high-availability AI infrastructure
  4. Optimize resource utilization, particularly GPU/accelerator efficiency, ensuring cost-effective AI system operation
  5. Lead efforts to build internal tooling and automation focused on AI system stability and operational excellence

Skills

Required

  • AI reliability engineering
  • SRE
  • DevOps for distributed systems
  • cloud platforms
  • monitoring tools
  • incident response automation

Nice to have

  • AI/ML infrastructure
  • GPU/xPU optimization and scaling
  • API platform operations
  • large-scale distributed services
  • observability tools tailored for AI and agentic systems
  • open-source contributions
  • reliability engineering thought leadership

What the JD emphasized

  • AI reliability engineering
  • SRE
  • DevOps for distributed systems
  • maintaining large-scale AI systems
  • integrating AI-specific metrics into reliability frameworks
  • cloud platforms
  • monitoring tools
  • incident response automation
  • AI system reliability
  • operational health
  • AI/ML infrastructure
  • GPU/xPU optimization and scaling
  • observability tools tailored for AI and agentic systems

Other signals

  • reliability metrics for AI services
  • observability and monitoring for AI systems
  • incident response for AI infrastructure
  • GPU/accelerator efficiency
  • AI system stability