Production Support Engineering Lmts

Salesforce Salesforce · Enterprise · San Francisco, CA +2

Salesforce is seeking a Production Support Engineer (LMTS) to join their embedded reliability team for the Agentforce for Supply Chain platform. This role focuses on production excellence, performance tuning, and infrastructure automation to scale the platform for global demand and enterprise-grade resilience. The engineer will partner with PMTS-level engineers, contribute to infrastructure strategy, maintain automated environments, support AI/ML infrastructure, harden the observability stack, optimize performance, and leverage AI tools for operational tasks. The role requires strong experience in SRE/Production Engineering, Kubernetes, Terraform, cloud platforms, and coding in Golang, TypeScript, or Python, with a deep understanding of distributed systems and AI agents. Advanced prompt engineering skills and an AI-first approach to engineering are essential.

What you'd actually do

  1. Own the reliability roadmap for major product areas, working to transition our systems from startup-speed architectures to highly-available, global-scale enterprise solutions.
  2. Partner with PMTS-level engineers to refine our infrastructure strategy, contributing senior-level perspectives on system design, capacity planning, and bottleneck identification.
  3. Maintain and evolve our automated environments, focusing on making our "infrastructure-as-plugins" model more robust and developer-friendly.
  4. Support the scaling of our AI/ML infrastructure, ensuring our models have the GPU resources and data pipelines required to deliver real-time supply chain insights.
  5. Lead the "1 to 100" hardening of our observability stack. You won’t just respond to incidents; you’ll build the tooling that prevents them and the telemetry that explains them.

Skills

Required

  • 5+ years of experience in SRE, Production Engineering, or Backend Engineering with a heavy focus on operations and scale
  • Proven Scaling Experience: You have previously helped take a product through a high-growth phase (the "1 to 100" journey), dealing with the technical debt and architectural shifts that come with it.
  • Technical Breadth: Strong proficiency in Kubernetes, Terraform/OpenTofu, and AWS/GCP/Azure.
  • Coding Mastery: Ability to write and review production-level code in Golang, TypeScript, or Python—you view automation as a software engineering problem.
  • Systems Expert: Deep understanding of distributed systems, including how to debug complex interactions between microservices, databases, and AI agents.
  • Low-Ego Collaboration: Experience working within a senior team of Principal engineers, capable of both leading specific initiatives and supporting the broader group’s technical vision.
  • A demonstrated, genuine AI-first approach to engineering. Using AI to move faster, build fluency across the stack, and contribute well beyond your core specialty.
  • Experience using AI tools (e.g., Claude Code, GitHub Copilot, Codex, Cursor, etc.) in development workflows
  • Advanced prompt engineering skills and the ability to write precise, structured prompts and cultivate the system context that makes AI outputs reliable, secure, and production-ready.

Nice to have

  • M.S. in Computer Science or equivalent practical experience.
  • Database Specialist: Strong experience with PostgreSQL at scale (partitioning, indexing, query tuning).
  • Distributed Systems: Advanced knowledge of microservice orchestration and durability patterns, including hands-on experience with Temporal for workflow reliability and service mesh for secure, observable service-to-service communication in high-growth SaaS environments.
  • Supply Chain/Logistics: Experience with the unique data constraints and reliability requirements of manufacturing or global logistics.
  • Salesforce Knowledge: Familiarity with

What the JD emphasized

  • scaling
  • production excellence
  • performance tuning
  • infrastructure automation
  • agentic capabilities
  • Scaling & Reliability
  • highly-available, global-scale enterprise solutions
  • infrastructure strategy
  • system design
  • capacity planning
  • bottleneck identification
  • Infrastructure as Code
  • automated environments
  • infrastructure-as-plugins
  • AI Operations (AIOps)
  • scaling of our AI/ML infrastructure
  • GPU resources
  • data pipelines
  • real-time supply chain insights
  • Production Excellence
  • hardening of our observability stack
  • build the tooling that prevents them
  • telemetry that explains them
  • Performance Engineering
  • SQL optimization
  • API latency
  • cross-service communication
  • data-intensive supply chain platform
  • performant under heavy load
  • AI-First Workflow
  • using AI tools
  • automate routine operational tasks
  • accelerate infrastructure delivery
  • building and maintaining the shared system context
  • system designs
  • constraints
  • standards
  • AI to operate accurately and reliably
  • Critically evaluate code (Human or AI-generated)
  • correctness
  • quality
  • security
  • performance
  • 5+ years of experience in SRE, Production Engineering, or Backend Engineering with a heavy focus on operations and scale
  • Proven Scaling Experience
  • high-growth phase
  • technical debt
  • architectural shifts
  • Technical Breadth
  • Kubernetes
  • Terraform/OpenTofu
  • AWS/GCP/Azure
  • Coding Mastery
  • production-level code
  • Golang
  • TypeScript
  • Python
  • automation as a software engineering problem
  • Systems Expert
  • Deep understanding of distributed systems
  • debug complex interactions
  • microservices
  • databases
  • AI agents
  • Low-Ego Collaboration
  • Principal engineers
  • leading specific initiatives
  • supporting the broader group’s technical vision
  • demonstrated, genuine AI-first approach to engineering
  • Using AI to move faster
  • build fluency across the stack
  • contribute well beyond your core specialty
  • Experience using AI tools
  • Claude Code
  • GitHub Copilot
  • Codex
  • Cursor
  • development workflows
  • Advanced prompt engineering skills
  • write precise, structured prompts
  • cultivate the system context
  • AI outputs reliable, secure, and production-ready

Other signals

  • Scaling architecture to handle global demand
  • Hardening systems for enterprise-grade resilience
  • Integrating deeply with the Agentforce ecosystem
  • Production excellence, performance tuning, and infrastructure automation
  • Scaling our AI/ML infrastructure
  • GPU resources and data pipelines required to deliver real-time supply chain insights
  • Hardening of our observability stack
  • Building the tooling that prevents incidents and the telemetry that explains them
  • Deep-dive into SQL optimization, API latency, and cross-service communication
  • Data-intensive supply chain platform remains performant under heavy load
  • Using AI tools to automate routine operational tasks and accelerate infrastructure delivery
  • Building and maintaining the shared system context
  • Critically evaluate code (Human or AI-generated) for correctness, quality, security, and performance
  • 5+ years of experience in SRE, Production Engineering, or Backend Engineering with a heavy focus on operations and scale
  • Proven Scaling Experience
  • Technical Breadth: Strong proficiency in Kubernetes, Terraform/OpenTofu, and AWS/GCP/Azure
  • Coding Mastery: Ability to write and review production-level code in Golang, TypeScript, or Python
  • Systems Expert: Deep understanding of distributed systems, including how to debug complex interactions between microservices, databases, and AI agents
  • Low-Ego Collaboration
  • A demonstrated, genuine AI-first approach to engineering
  • Experience using AI tools (e.g., Claude Code, GitHub Copilot, Codex, Cursor, etc.) in development workflows
  • Advanced prompt engineering skills and the ability to write precise, structured prompts and cultivate the system context that makes AI outputs reliable, secure, and production-ready