Staff Software Engineer, Platform Engineering

Snap Snap · Consumer · Los Angeles, CA

Staff Software Engineer to join the Platform Engineering team, focusing on building AI-powered testing tools and infrastructure. This includes agent harnesses, evaluation systems, Temporal-based workflows, and telemetry-driven debugging capabilities to improve developer productivity and software quality across Snapchat's mobile apps and backend services.

What you'd actually do

  1. Build and own agent harnesses and testing infrastructure—tool design, prompt engineering, context management, and output evaluation—to drive functional and load testing across Snapchat's mobile apps and multi-cloud backend services
  2. Push AI-native engineering practices across the team: writing reusable skills, standing up looped and scheduled agents, building MCP tools, and offloading ops work (deploys, log triage, JIRA updates) to agents instead of doing it by hand
  3. Architect Temporal-based workflows and services that speed up detection of bugs and regressions in the CI/CD pipeline, applying async Python, workflow determinism constraints, and typed dataclass-driven design
  4. Build the telemetry and evaluation layer that makes agent behavior measurable—writing non-trivial BigQuery SQL, reasoning about mobile telemetry (Blizzard-style events, client vs. server timestamps, sampling), and turning raw logs into actionable hypotheses
  5. Work across teams to understand product requirements, evaluate trade-offs, and deliver the solutions needed to ship innovative products

Skills

Required

  • Experience designing, building, and operating backend services or distributed systems at significant scale.
  • Proven track record of owning highly-available, mission‑critical systems, including on‑call participation, incident response, and driving systemic fixes.
  • Ability to set technical vision and lead complex, cross‑functional initiatives over multiple quarters, balancing architectural quality, reliability, and product velocity.
  • Strong foundation in system design (APIs, data models, storage, pub/sub, queues, and workflow orchestration) and performance/latency optimization.
  • Deep experience with observability (metrics, logging, tracing, dashboards) and using data to debug, harden, and evolve large-scale systems.
  • Excellent collaboration and communication skills; able to work effectively with Product, DS, ML, Design, and other engineering teams to align on requirements and trade‑offs.
  • Ability to mentor and uplevel engineers, provide clear technical guidance, and create structures that make the team more effective over time
  • Bachelor’s degree in a technical field such as Computer Science, or equivalent practical experience
  • 9+ years of software development experience; or Master’s degree with 8+ years of experience; or PhD with 5+ years of experience
  • Experience acting as a technical lead, domain expert, or owner of complex technical initiatives
  • Experience building backend systems or distributed systems in production environments

Nice to have

  • Experience with Java, Go, Python, C++, or similar backend languages
  • Experience with large-scale microservices, cloud infrastructure, storage systems, or platform architecture
  • Experience with Kubernetes, containerized systems, data infrastructure, or service platforms
  • Experience with developer tooling, CI/CD, internal platforms, or engineering productivity systems
  • Experience building AI developer tools, coding assistants, eval systems, or workflow automation for engineers
  • Experience driving multi-year technical direction for a platform or infrastructure area
  • Track record of delivering large-scale, high-impact technical work across team boundaries

What the JD emphasized

  • AI-powered testing tools
  • agent harnesses
  • evaluation systems
  • Temporal-based workflows
  • telemetry-driven debugging
  • agent behavior measurable
  • AI-native engineering practices

Other signals

  • AI-powered testing tools
  • agent harnesses
  • evaluation systems
  • Temporal-based workflows
  • telemetry-driven debugging