Senior Software Engineer - Compute

Aurora Innovation Aurora Innovation · Robotics · PITHQ · Software Technology Foundations

This role is for a Senior Software Engineer on the Compute team at Aurora, focusing on building and maintaining a custom, large-scale distributed batch compute engine (BatchAPI) and its associated Python SDK. The engine manages millions of compute tasks for data processing, simulation, and ML training, built on Kubernetes primitives but with a custom scheduler. The role involves designing low-latency APIs, resilient communication protocols, and high-level workflow abstractions to enable engineers across the company to manage complex computational workloads.

What you'd actually do

  1. Design, implement, and maintain core components of the high-performance, large-scale distributed batch compute engine (BatchAPI). Architect and optimize the scheduler, resource allocator, and execution engine of BatchAPI to handle bursty, heterogeneous workloads with minimal overhead.
  2. Design low-latency APIs and resilient communication protocols that bridge our Python SDK with the Golang-based core engine.
  3. Develop high-level workflow abstractions, enabling engineers across the company to programmatically define, deploy, and manage complex data processing, simulation, and ML training pipelines.
  4. Solve complex problems in distributed locking, throttling, and fair-share scheduling to ensure multi-tenant stability.
  5. Drive continuous improvements in the performance, scalability, and resilience of the entire compute infrastructure, implementing robust monitoring and alerting systems to maintain operational excellence for critical workflows.

Skills

Required

  • 5+ years of professional software engineering experience
  • Deep expertise in Golang
  • Deep expertise in Python
  • Strong understanding of distributed systems fundamentals
  • Experience with performance profiling and tuning
  • Specialized knowledge of container orchestration systems like Kubernetes
  • Proven track record of driving continuous performance, scalability, and resilience improvements in production environments managing critical data
  • Familiarity with cloud provider compute and data services (e.g., AWS EKS, S3, RDS)

Nice to have

  • Experience working with computational workloads specific to the autonomous vehicle, robotics, or large-scale machine learning domains
  • Demonstrated ability in creating and refining user-facing tools
  • Web UI development experience (Typescript, React)

What the JD emphasized

  • custom scheduler
  • distributed systems fundamentals
  • performance profiling and tuning
  • container orchestration systems like Kubernetes
  • continuous performance, scalability, and resilience improvements