Software Engineering Manager, Fault Tolerance Testing

Google Google · Big Tech · Kirkland, WA +1

Google is seeking a Software Engineering Manager to lead a team focused on transitioning Fault Tolerance Testing (FTT) from manual tools to a proactive, AI-driven, and autonomous resilience platform. This role involves technical leadership, people management, and driving the technical roadmap for a platform at the intersection of large-scale distributed systems, developer velocity, and cloud reliability.

What you'd actually do

  1. Set and communicate team priorities that support the broader organization's goals. Align strategy, processes, and decision-making across teams.
  2. Set clear expectations with individuals based on their level and role and aligned to the broader organization's goals. Meet regularly with individuals to discuss performance and development and provide feedback and coaching.
  3. Develop the mid-term technical goal and roadmap within the scope of your (often multiple) team(s). Evolve the roadmap to meet anticipated future requirements and infrastructure needs.
  4. Design, guide and vet systems designs within the scope of the broader area, and write product or system development code to solve ambiguous problems.
  5. Review code developed by other engineers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency).

Skills

Required

  • software development
  • creating product roadmaps
  • working with cross-functional teams
  • reliability engineering
  • technical leadership
  • people management
  • team leadership

Nice to have

  • designing, building, or operating highly available, fault-tolerant distributed systems
  • chaos engineering
  • fault-injection testing frameworks
  • large-scale disaster recovery simulations
  • designing developer-facing products, APIs, SDKs, or self-service automation tools
  • reducing friction
  • improving developer velocity
  • Google’s server frameworks (Pod)
  • gRPC/Stubby-based RPC layers
  • container orchestrators
  • defining and driving organizational key metrics (SLIs/SLOs, adoption rates, platform health)

What the JD emphasized

  • AI-driven
  • autonomous resilience platform
  • fault tolerance testing
  • cloud reliability
  • highly available, fault-tolerant distributed systems
  • chaos engineering, fault-injection testing frameworks, or large-scale disaster recovery simulations

Other signals

  • AI-driven platform
  • autonomous resilience platform
  • fault tolerance testing
  • cloud reliability