Senior Systems Engineer, Workers AI

Cloudflare Cloudflare · Enterprise · Austin, TX, London, United Kingdom · Emerging Technology and Incubation

Senior Systems Engineer to design and build the core infrastructure for AI inference across Cloudflare's global network, focusing on distributed systems, high-performance computing, and optimizing the model scheduling and deployment platform.

What you'd actually do

  1. Develop and maintain core components of the serverless inference platform to ensure high availability and scalability for Cloudflare users.
  2. Optimize the model scheduling system to significantly increase efficiency and resource utilization across our inference infrastructure.
  3. Implement improvements to the inference request routing logic to enhance overall performance and reduce latency for end-users.
  4. Drive significant, measurable improvements in the platform's reliability and resilience by identifying and mitigating systemic risks.
  5. Expand and refine the observability stack, including metrics, logging, and tracing, and fine-tune alerts to proactively identify and resolve production issues.

Skills

Required

  • systems engineering
  • distributed, high-performance systems
  • Rust programming
  • asynchronous environment
  • networking and application protocols
  • TCP
  • HTTP
  • WebSocket
  • scaling and performance optimization techniques
  • load balancing
  • caching in a distributed environment

Nice to have

  • container orchestration platforms
  • Kubernetes
  • Nomad
  • large-scale inference serving
  • LLM
  • diffusion models

What the JD emphasized

  • core infrastructure that powers AI inference
  • distributed systems and high-performance computing
  • sub-second model cold starts
  • multi-accelerator workload scheduling
  • efficient KV cache management
  • model deployment platform
  • AI inference platform embedded in the fabric of the internet
  • foundational infrastructure problems
  • define how AI runs at the edge of the network
  • high availability and scalability
  • significantly increase efficiency and resource utilization
  • enhance overall performance and reduce latency
  • significant, measurable improvements in the platform's reliability and resilience
  • proactively identify and resolve production issues
  • complex, cross-functional technical projects
  • Expert proficiency in Rust programming

Other signals

  • AI inference across Cloudflare's global network
  • real-time voice, frontier open LLMs, and customer-deployed models
  • distributed systems and high-performance computing
  • AI inference platform embedded in the fabric of the internet