Principal Software Engineer, Stateless Jobs Platform (core Services)

Roblox Roblox · Consumer · San Mateo, CA · Software Engineering

Roblox is seeking a Principal Software Engineer to build and operate a massive-scale, multi-region stateless jobs platform at the intersection of Cloud Engineering and AI Infrastructure. The role involves designing and developing custom Kubernetes Operators and Controllers in Go to automate the lifecycle of high-throughput workloads, architecting hybrid-cloud mobility, extending the Kubernetes control plane, and empowering developer velocity through platform abstractions. This is a core infrastructure role focused on scalability, reliability, and automation for global real-time experiences.

What you'd actually do

  1. Build the Orchestration Engine: Design and develop custom Kubernetes Operators and Controllers in Go to automate the entire lifecycle of high-throughput, mission-critical stateless workloads.
  2. Architect Hybrid-Cloud Mobility: Create systems that enable workloads to move seamlessly between on-premise and public cloud environments, ensuring high availability and automated failover during regional outages.
  3. Extend the Kubernetes Control Plane: Write performant reconciliation loops and Custom Resource Definitions (CRDs) to handle complex scheduling logic and resource optimization for massive CPU and GPU-intensive fleets.
  4. Empower Developer Velocity: Build high-level platform abstractions and automation that allow service owners to deploy global-scale code without needing to manage the underlying container orchestration.

Skills

Required

  • 10+ years of experience building web services using Golang or similar language.
  • Experience building and operating K8’s clusters.
  • Deep understanding of Kubernetes internals (control plane, reconciliation loops, scheduling, networking).
  • Experience building large scale distributed systems with focus on scalability, reliability, and availability.
  • Experience building or operating control-plane or orchestration systems (e.g., schedulers, workflow engines, or compute platforms).
  • Strong knowledge of distributed systems fundamentals such as leader election, event-driven architectures, messaging/queuing, or distributed state management.
  • Experience designing systems that handle multi-region orchestration, failover, disaster recovery, or large-scale reliability challenges.
  • Experience with Oncall and in troubleshooting live site issues.
  • Experience leading cross team greenfield projects.
  • Bachelor’s degree in Computer Science or a related field, or equivalent experience.
  • Experience writing Kubernetes Operators or custom controllers using Operator-SDK or control runtime.

What the JD emphasized

  • custom Kubernetes Operators and Controllers
  • hybrid-cloud environment
  • Kubernetes Control Plane
  • massive CPU and GPU-intensive fleets
  • global-scale code