Staff Software Engineer, Ai/ml, Google Distributed Cloud, Storage

Google Google · Big Tech · Kirkland, WA +1

Staff Software Engineer for Google Distributed Cloud, Storage, focusing on building the foundational data layer for AI/ML workloads at the edge and in air-gapped environments. The role involves designing and developing scalable, distributed storage solutions optimized for massive throughput and ultra-low latency to support AI training and inference, and integrating with external storage hardware.

What you'd actually do

  1. Design and develop scalable, distributed File and Object storage solutions that serve as the critical foundational backbone for complex AI/ML workloads within the GDC environment.
  2. Engineer advanced solutions optimized for massive throughput, specifically enabling high-frequency model checkpointing and the ultra-low latency data access required for AI training and inference.
  3. Drive technical execution with external partners (e.g., VAST Data) to seamlessly integrate industry-leading, high-performance storage hardware with Google’s distributed software ecosystem.
  4. Take full life-cycle responsibility for core storage services, ensuring uncompromising security, data durability, and high availability across disconnected edge and on-premises data center environments.
  5. Partner closely with AI infrastructure, compute, and networking teams to architect system-wide improvements, eliminate Input/Output bottlenecks, and deliver a unified "cloud-anywhere" experience.

Skills

Required

  • Software development
  • ML infrastructure
  • Systems architecture
  • Distributed systems
  • Large-scale storage architectures
  • Systems programming
  • C++
  • Go
  • Rust

Nice to have

  • Master’s degree or PhD in Engineering, Computer Science, or a related technical field
  • Data structures and algorithms
  • Technical leadership
  • Cross-functional project management

What the JD emphasized

  • 8 years of experience in software development
  • 5 years of experience with ML design and ML infrastructure (e.g., model deployment, model evaluation, data processing, debugging, fine tuning)
  • 5 years of experience in systems architecture, including building or maintaining distributed systems or large-scale storage architectures
  • 5 years of experience in systems programming (e.g., C++, Go, or Rust)

Other signals

  • AI models require massive throughput and ultra-low latency
  • optimizing the data pipeline to keep GPUs and accelerators fully saturated
  • high-performance, massively scalable storage systems directly to customer data centers