Senior Software Engineer, Cloud Ai/ml Infrastructure

Google Google · Big Tech · Taipei, Taiwan

Senior Software Engineer role focused on optimizing the performance, developing tools, and troubleshooting AI/ML training and inference workloads within Google's AI infrastructure. The role involves working with distributed computing, GPUs/TPUs, and ensuring the reliability and efficiency of AI services.

What you'd actually do

  1. Optimize performance across the AI infrastructure technical stack through in-depth performance profiling, debugging, and troubleshooting of AI/ML training and inference workloads.
  2. Develop tools and software for our AI/ML Infrastructure to deliver end-to-end developer experience.
  3. Collaborate with cross-functional, cross-regional teams to ensure AI/ML infrastructure delivers exceptional value and drives success for customers.
  4. Identify and resolve performance bottlenecks to maintain infrastructure that operates at peak capacity.
  5. Shape the future of AI/ML infrastructure by identifying gaps in the existing products and recommending enhancements.

Skills

Required

  • 5 years of experience with software development in one or more programming languages
  • 3 years of experience testing, maintaining, or launching software products
  • 1 year of experience with software design and architecture

Nice to have

  • Generative AI, Large Language Models (LLM), or Machine Learning infrastructure, including model deployment, performance optimization, profiling, and debugging
  • distributed computing leveraging GPUs or TPUs
  • Ability to scope and solve ambiguous problems, grow in a dynamic, fast-paced environment where AI technologies are continuously advancing
  • Ability to demonstrate proven technical leadership by aligning team objectives and timelines with those of multiple adjacent teams
  • Ability to collaborate effectively with cross-functional and cross-regional teams

What the JD emphasized

  • AI/ML training and inference workloads
  • Generative AI, Large Language Models (LLM), or Machine Learning infrastructure
  • distributed computing leveraging GPUs or TPUs

Other signals

  • Optimize performance across the AI infrastructure technical stack
  • Develop tools and software for our AI/ML Infrastructure
  • Identify and resolve performance bottlenecks
  • Experience with Generative AI, Large Language Models (LLM), or Machine Learning infrastructure