Staff Software Engineer, Platform Infrastructure

Aurora Innovation Aurora Innovation · Robotics · PITHQ · Software Technology Foundations

Staff Software Engineer to lead a team stabilizing and modernizing Offline Testing Infrastructure (OTI), a critical middle-layer infrastructure for PR testing, test creation, and Verification & Validation (V&V) efforts. The role involves defining and optimizing the testing ecosystem, owning OTI components, establishing architecture and best practices, and serving as a center of excellence for offline testing.

What you'd actually do

  1. Lead the OTI Team: Serve as the technical lead (TL) for the OTI team within PIE-Compute, driving the strategic vision, execution, and long-term stability of the core infrastructure.
  2. Lead the design of the next-generation offline testing architecture to meet diverse team needs, reducing redundancy and siloing across the organization.
  3. Partner with Test Creation and Test Drive teams to standardize end-to-end test execution and reporting (Creation -> Execution -> Reporting).
  4. Take ownership of the shared OTI components, including maintenance and on-call support
  5. Define and enforce data management policies for the testing ecosystem (storage, lifecycling, write strategies, data integrity, and lineage).

Skills

Required

  • Senior or Staff-level experience (P7 equivalent) as a Software Engineer, ideally in infrastructure, developer tooling, or critical shared services.
  • Proven experience leading technical projects and mentoring/directing other engineers.
  • Familiarity with distributed compute technologies, cloud services (e.g., AWS), and large-scale workflow management systems
  • Demonstrated ability to triage, debug, and perform on-call and incident management for complex, cross-cutting infrastructure issues.
  • Strong communication skills to manage stakeholder alignment and drive cross-team standardization efforts.

What the JD emphasized

  • stabilizing and modernizing our Offline Testing Infrastructure (OTI)
  • critical, shared middle-layer infrastructure
  • improving the velocity of our engineering teams and the reliability of our release cycles
  • transition OTI to a stable, performant, and scalable platform
  • standardize end-to-end test execution and reporting
  • ensure performance and scalability
  • maintain clear attribution of failures to enhance reliability and efficient debugging
  • common OTI tooling, including launching test evaluations, polling APIs, communicating results, and providing recommended pipeline templates
  • data management policies for the testing ecosystem
  • single versus cross-modality testing strategies
  • manage incidents related to offline tests
  • optimizing the architecture and performance of Aurora's largest compute use case (offline testing)