Software Engineer Iii, Ai/ml, Google Cloud

Google Google · Big Tech · Hyderabad, Telangana, India

Software Engineer III role focused on enabling and optimizing foundational AI/ML models (LLMs, Diffusion) within Google Cloud infrastructure, specifically using frameworks like vLLM and MaxText. The role involves partnering with customers to measure model performance, identify technical bottlenecks, collaborate with infrastructure teams, and design specialized ML solutions. It requires experience with ML infrastructure, GenAI concepts, and debugging training/inference workloads.

What you'd actually do

  1. Enable and optimize foundational models (e.g., LLMs and Diffusion) within key frameworks like vLLM, MaxText, and MaxDiffusion, providing Google Cloud customers with immediate access to AI capabilities.
  2. Partner with customers to measure Artificial Intelligence/Machine Learning (AI/ML) model performance on Google Cloud infrastructure. Identify and resolve technical bottlenecks to drive customer success working with Customer Engineers teams.
  3. Collaborate with internal infrastructure teams to enhance support for demanding AI workloads. Contribute to product improvement by identifying bugs and recommending enhancements.
  4. Conduct performance profiling, debugging, and troubleshooting of training and inference workloads.
  5. Design and implement specialized Machine Leaning solutions leveraging advanced ML infrastructure.

Skills

Required

  • Python
  • ML infrastructure (model deployment, model evaluation, data processing, debugging)
  • GenAI concepts (Large Language Model, Multi-Modal, Large Vision Models)
  • text, image, video, or audio generation

Nice to have

  • Master’s degree or PhD in Computer Science or a related technical field
  • Generative AI, Large Language Models (LLM), or Machine Learning infrastructure, including model deployment, performance optimization, profiling, and debugging large-scale workloads.
  • distributed computing leveraging Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs)
  • Ability to collaborate effectively with cross-functional teams.
  • Ability to thrive in a changing environment where AI technologies are continuously advancing.

What the JD emphasized

  • foundational models
  • AI workloads
  • training and inference workloads
  • ML infrastructure

Other signals

  • Enabling foundational models for customers
  • Optimizing inference frameworks
  • Partnering with customers on AI/ML performance
  • Designing ML solutions on cloud infrastructure