Principal Solutions Architect, Foundation Model Providers

Amazon Amazon · Big Tech · Seattle, WA · Solutions Architect

Principal Solutions Architect for Foundation Model Providers on AWS, focusing on designing and optimizing cloud architectures for training, fine-tuning, and serving large-scale generative AI models. This role involves deep technical expertise in cloud infrastructure (networking, GPUs, storage, orchestration) and ML fundamentals to act as a trusted advisor to model providers, influencing AWS service roadmaps and driving best practices for AI workloads at massive scale.

What you'd actually do

  1. Maintain and foster relationships with model providers, becoming their trusted technical advisor and strategic partner
  2. Develop deep knowledge of core foundational services (Compute, Network, Storage) along with ML expertise to build long-term relationships with customer engineering teams
  3. Dive deep to understand the details of model provider’s environment, business goals, and technical requirements for building and deploying foundation models
  4. Design and implement advanced cloud architectures that enable model providers to scale their AI research and production workloads efficiently
  5. Partner closely with AWS service teams (EC2, Global Networking, EKS, Bedrock, S3) to influence roadmaps and develop custom solutions that meet model provider’s unique requirements

Skills

Required

  • 10+ years of specific technology domain areas (e.g. software development, cloud computing, systems engineering, infrastructure, security, networking, data & analytics) experience
  • Bachelor's degree in computer science, engineering, mathematics or equivalent
  • Experience developing technology solutions and evangelising end-to-end technology roadmaps that guide IT transformations toward cloud computing
  • Experience communicating across technical and non-technical audiences and at C-level, including training, workshops, publications

Nice to have

  • Knowledge of large scale automation and workflow management or equivalent
  • Knowledge of presentations and whiteboarding skills with a high degree of comfort speaking with internal and external executives, IT management, and developers
  • Experience with training and deploying machine learning systems to solve large-scale optimizations, or experience operating highly available, distributed systems of data extraction, ingestion, and processing of large data sets
  • Experience with CUDA kernels or ML/low-level kernels, or experience with Machine Learning and Large Language Model fundamentals, including architecture, training/inference lifecycles, and optimization of model execution
  • Experience in Kubernetes, Docker or containers ecosystem
  • Knowledge of foundation model architectures, training approaches, and serving infrastructure

What the JD emphasized

  • exabyte-scale data
  • millions of interconnected GPUs
  • custom hardware acceleration requirements
  • GPU optimization
  • network throughput
  • distributed training
  • cost efficiency at massive scale
  • training and deploying machine learning systems
  • highly available, distributed systems
  • large data sets
  • CUDA kernels
  • ML/low-level kernels
  • Machine Learning and Large Language Model fundamentals
  • architecture, training/inference lifecycles
  • optimization of model execution
  • foundation model architectures
  • training approaches
  • serving infrastructure

Other signals

  • design and implement advanced cloud architectures that enable model providers to scale their AI research and production workloads efficiently
  • Drive technical and architectural best practices for GPU optimization, network throughput, distributed training, and cost efficiency at massive scale
  • Experience with training and deploying machine learning systems to solve large-scale optimizations, or experience operating highly available, distributed systems of data extraction, ingestion, and processing of large data sets
  • Experience with CUDA kernels or ML/low-level kernels, or experience with Machine Learning and Large Language Model fundamentals, including architecture, training/inference lifecycles, and optimization of model execution
  • Knowledge of foundation model architectures, training approaches, and serving infrastructure