What you'd actually do

Design and implement production-grade AI platform architectures on Kubernetes and public cloud infrastructure (AWS, Azure, and GCP).

Partner directly with customer platform, infrastructure, and ML engineering teams to deploy, operate, and optimize distributed AI workloads.

Lead implementation engagements that include platform installation, networking, security, observability, scaling, upgrades, and operational readiness.

Troubleshoot complex distributed systems issues spanning infrastructure, Kubernetes, networking, storage, and AI applications.

Develop automation, tooling, reference implementations, and infrastructure-as-code that accelerate customer success and improve repeatability.

Skills

Required

5+ years of experience in cloud infrastructure, platform engineering, DevOps, Site Reliability Engineering, or software engineering.
Experience building, deploying, or operating ML/AI platforms that support model training, inference, or large-scale data processing workloads.
Strong expertise with Kubernetes and containerized production environments.
Experience operating cloud infrastructure on AWS, Azure, or GCP, including networking, security, IAM, storage, and infrastructure automation.
Experience with Infrastructure as Code and modern DevOps tooling such as Terraform, Helm, GitOps, CI/CD pipelines, or similar technologies.
Strong software engineering skills in Python, Go, Java, or a comparable language, with experience building automation or production services.
Experience working directly with enterprise customers in consulting, professional services, field engineering, solutions architecture, or another customer-facing engineering role.
Excellent communication skills and the ability to work effectively with both executive and deeply technical stakeholders.

Nice to have

Familiarity with distributed computing frameworks such as Ray, Spark, Dask, or Kubernetes-native distributed systems is a strong plus.
A passion for solving difficult customer problems and building reusable technical solutions.
Willingness to travel as needed to work alongside strategic customers.

At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We’re commercializing Ray, a popular open-source project that's creating an ecosystem of libraries for scalable machine learning. Companies like OpenAI, Uber, Spotify, Instacart, Cruise, and many more, have Ray in their tech stacks to accelerate the progress of AI applications out into the real world.

With Anyscale, we’re building the best place to run Ray, so that any developer or data scientist can scale an ML application from their laptop to the cluster without needing to be a distributed systems expert.

Proud to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date.

About the role:

As a Forward Deployed Engineer - AI/ML Platforms at Anyscale, you’ll partner with some of the world’s most sophisticated AI organizations to design, deploy, and operate the infrastructure powering their production AI workloads.

In this role you will work directly with customer platform, infrastructure, and ML engineering teams to solve complex technical challenges. You will help customers build scalable AI platforms, modernize ML infrastructure, and operationalize distributed AI applications on Ray and the Anyscale platform.

You will combine deep cloud infrastructure expertise with strong customer engagement skills, serving as both a trusted technical advisor and a hands-on engineer. You will work closely with customer teams throughout implementation, from architecture and deployment through production operations. Your work will provide feedback that directly influences the evolution of the Anyscale platform.

In this role, you will:

Design and implement production-grade AI platform architectures on Kubernetes and public cloud infrastructure (AWS, Azure, and GCP).
Partner directly with customer platform, infrastructure, and ML engineering teams to deploy, operate, and optimize distributed AI workloads.
Lead implementation engagements that include platform installation, networking, security, observability, scaling, upgrades, and operational readiness.
Troubleshoot complex distributed systems issues spanning infrastructure, Kubernetes, networking, storage, and AI applications.
Develop automation, tooling, reference implementations, and infrastructure-as-code that accelerate customer success and improve repeatability.
Build trusted relationships with technical leaders, platform teams, and executive stakeholders, translating business objectives into robust technical solutions.
Collaborate closely with Product and Engineering to communicate customer requirements, identify product improvements, and shape future platform capabilities.
Share best practices through technical documentation, architecture guidance, workshops, and enablement.

**We'd love to hear from you if you have: **

5+ years of experience in cloud infrastructure, platform engineering, DevOps, Site Reliability Engineering, or software engineering.
Experience building, deploying, or operating ML/AI platforms that support model training, inference, or large-scale data processing workloads.
Strong expertise with Kubernetes and containerized production environments.
Experience operating cloud infrastructure on AWS, Azure, or GCP, including networking, security, IAM, storage, and infrastructure automation.
Experience with Infrastructure as Code and modern DevOps tooling such as Terraform, Helm, GitOps, CI/CD pipelines, or similar technologies.
Strong software engineering skills in Python, Go, Java, or a comparable language, with experience building automation or production services.
Experience working directly with enterprise customers in consulting, professional services, field engineering, solutions architecture, or another customer-facing engineering role.
Excellent communication skills and the ability to work effectively with both executive and deeply technical stakeholders.
Familiarity with distributed computing frameworks such as Ray, Spark, Dask, or Kubernetes-native distributed systems is a strong plus.
A passion for solving difficult customer problems and building reusable technical solutions.
Willingness to travel as needed to work alongside strategic customers.

Proud to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date.

About the role:

In this role, you will:

Design and implement production-grade AI platform architectures on Kubernetes and public cloud infrastructure (AWS, Azure, and GCP).
Partner directly with customer platform, infrastructure, and ML engineering teams to deploy, operate, and optimize distributed AI workloads.
Lead implementation engagements that include platform installation, networking, security, observability, scaling, upgrades, and operational readiness.
Troubleshoot complex distributed systems issues spanning infrastructure, Kubernetes, networking, storage, and AI applications.
Develop automation, tooling, reference implementations, and infrastructure-as-code that accelerate customer success and improve repeatability.
Build trusted relationships with technical leaders, platform teams, and executive stakeholders, translating business objectives into robust technical solutions.
Collaborate closely with Product and Engineering to communicate customer requirements, identify product improvements, and shape future platform capabilities.
Share best practices through technical documentation, architecture guidance, workshops, and enablement.

**We'd love to hear from you if you have: **

5+ years of experience in cloud infrastructure, platform engineering, DevOps, Site Reliability Engineering, or software engineering.
Experience building, deploying, or operating ML/AI platforms that support model training, inference, or large-scale data processing workloads.
Strong expertise with Kubernetes and containerized production environments.
Experience operating cloud infrastructure on AWS, Azure, or GCP, including networking, security, IAM, storage, and infrastructure automation.
Experience with Infrastructure as Code and modern DevOps tooling such as Terraform, Helm, GitOps, CI/CD pipelines, or similar technologies.
Strong software engineering skills in Python, Go, Java, or a comparable language, with experience building automation or production services.
Experience working directly with enterprise customers in consulting, professional services, field engineering, solutions architecture, or another customer-facing engineering role.
Excellent communication skills and the ability to work effectively with both executive and deeply technical stakeholders.
Familiarity with distributed computing frameworks such as Ray, Spark, Dask, or Kubernetes-native distributed systems is a strong plus.
A passion for solving difficult customer problems and building reusable technical solutions.
Willingness to travel as needed to work alongside strategic customers.

Forward Deployed Engineer - Ai/ml Platforms

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals