Software Engineer, Platform Infrastructure (foundations)

Anyscale Anyscale · Data AI · San Francisco, CA +1 · Engineering

Software Engineer role focused on building and scaling the platform infrastructure for distributed AI applications using Ray. Responsibilities include control plane and data plane development, Kubernetes, container orchestration, and cloud-native infrastructure. The role involves optimizing performance, reliability, and observability of the platform.

What you'd actually do

  1. Design, build, and scale services that orchestrate Ray clusters across cloud and on-prem environments, supporting both VM-based and Kubernetes-based deployments
  2. Optimize control plane components for large-scale, distributed AI/ML workloads
  3. Build intelligent scheduling and resource management systems for heterogeneous compute clusters
  4. Develop features to enhance the reliability, performance, scalability, and observability of Anyscale-managed Ray workloads
  5. Support and optimize accelerator integration (e.g., GPUs, TPUs).

Skills

Required

  • 3+ years of experience writing high-quality production code
  • Hands-on experience in building and maintaining highly available, scalable, and performant distributed system
  • Expertise in cloud-native technologies (AWS, Azure, GCP) and Kubernetes-based deployments
  • Deep understanding of networking, security, and authentication mechanisms in cloud environment
  • Familiarity with observability stacks (Prometheus, Grafana etc)
  • Proficiency in Go and Python
  • Knowledge of low-level operating system foundations (Linux kernel, file systems, containers)

Nice to have

  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience

What the JD emphasized

  • scale services
  • large-scale, distributed AI/ML workloads
  • intelligent scheduling and resource management systems
  • reliability, performance, scalability, and observability
  • high-performance execution of distributed workloads
  • scale an ML application from their laptop to the cluster
  • highly available, scalable, and performant distributed system

Other signals

  • building the scalable, secure, and robust backbone that enables this vision
  • design, build, and scale services that orchestrate Ray clusters across cloud and on-prem environments
  • optimize control plane components for large-scale, distributed AI/ML workloads
  • build intelligent scheduling and resource management systems for heterogeneous compute clusters
  • develop features to enhance the reliability, performance, scalability, and observability of Anyscale-managed Ray workloads