Software Engineer Iii, Ai/ml, Google Cloud

Google Google · Big Tech · Hyderabad, Telangana, India

Software Engineer III at Google Cloud focused on enabling and optimizing foundational AI/ML models (LLMs, Diffusion) for customer use on Google Cloud infrastructure. This involves working with frameworks like vLLM and MaxText, partnering with customers and internal teams to resolve technical bottlenecks, conducting performance profiling and debugging of training and inference workloads, and designing ML solutions leveraging advanced ML infrastructure. The role requires experience with ML infrastructure, GenAI concepts, and software development.

What you'd actually do

  1. Enable and optimize foundational models (e.g., LLMs and Diffusion) within key frameworks like vLLM, MaxText, and MaxDiffusion, providing Google Cloud customers with immediate access to AI capabilities.
  2. Partner with customers to measure Artificial Intelligence/Machine Learning (AI/ML) model performance on Google Cloud infrastructure. Identify and resolve technical bottlenecks to drive customer success working with Customer Engineers teams.
  3. Collaborate with internal infrastructure teams to enhance support for demanding AI workloads. Contribute to product improvement by identifying bugs and recommending enhancements.
  4. Conduct performance profiling, debugging, and troubleshooting of training and inference workloads. . Maintain and update documentation and educational content based on product changes and user feedback. Triage, debug, and resolve system issues by analyzing root causes and operational impact.
  5. Design and implement specialized Machine Leaning solutions leveraging advanced ML infrastructure.

Skills

Required

  • software development in Python
  • ML infrastructure (model deployment, model evaluation, data processing, debugging)
  • GenAI concepts (LLM, Multi-Modal, Large Vision Models)
  • text, image, video, or audio generation

Nice to have

  • Master’s degree or PhD in Computer Science or a related technical field
  • Generative AI, Large Language Models (LLM), or Machine Learning infrastructure, including model deployment, performance optimization, profiling, and debugging large-scale workloads.
  • distributed computing leveraging GPUs or TPUs
  • collaborate effectively with cross-functional teams
  • thrive in a changing environment where AI technologies are continuously advancing

What the JD emphasized

  • foundational models
  • LLMs
  • Diffusion
  • vLLM
  • MaxText
  • MaxDiffusion
  • AI capabilities
  • AI/ML model performance
  • technical bottlenecks
  • AI workloads
  • training and inference workloads
  • ML infrastructure
  • GenAI concepts
  • Large Language Model
  • Multi-Modal
  • Large Vision Models
  • text, image, video, or audio generation
  • Generative AI
  • Large Language Models (LLM)
  • Machine Learning infrastructure
  • model deployment
  • performance optimization
  • profiling
  • debugging
  • large-scale workloads
  • distributed computing
  • GPUs
  • TPUs

Other signals

  • enabling foundational models for customers
  • optimizing models within frameworks
  • performance profiling and debugging
  • designing ML solutions on ML infrastructure