Principal Software Engineer - Storage Cache

Roblox Roblox · Consumer · San Mateo, CA · Software Engineering

Roblox's Cache team is building a next-generation caching solution designed to deliver sub-millisecond average latency, horizontal scalability, and high efficiency—all at a drastically lower cost. Their ultimate vision is to shape a caching infrastructure capable of supporting 1 billion Daily Active Users while reducing costs by 90%. This Principal Engineer role will lead architectural transition, drive optimizations, and design robust frameworks for this mission-critical service.

What you'd actually do

  1. Lead the architectural transition to a next-generation, multitenant caching service built on ValKey, ensuring strict data, resource, and failure isolation for all tenants.
  2. Drive systemic optimizations to mitigate head-of-line blocking, manage hot keys, and maximize CPU and memory utilization across physical machine clusters.
  3. Design and build robust frameworks to automate development, chaos testing (fault/latency injection), and monitoring for 24x7 mission-critical services, targeting 99.99%+ availability and elastic scalability.
  4. Champion engineering best practices by leading design reviews, performance benchmarking, failure drills, and blameless post-incident retrospectives.
  5. Mentor and empower engineers, fostering a culture of deep domain expertise and seamless knowledge sharing across the Storage, Platform, and Product teams.

Skills

Required

  • BS degree in Computer Science (or equivalent professional experience)
  • 8+ years of hands-on software engineering experience
  • Deep domain knowledge in building and operating large-scale distributed systems
  • Strong builder mindset with proven experience running Active/Active distributed systems on container orchestrators like Kubernetes or Nomad
  • Strong, hands-on programming experience in Go and C++
  • Proven success in resolving massive-scale bottlenecks
  • Hands-on experience with modern telemetry and observability stacks (e.g., Prometheus, Grafana, AlertManager, Kibana)

Nice to have

  • Open Source Contributions: A track record of contributing to or maintaining major open-source caching projects such as Redis, ValKey, or Memcached.
  • Advanced Cache Internals: Experience extending cache functionality (e.g., writing custom Redis modules in C/Rust, complex Lua scripting) or deep-tuning underlying memory allocators like jemalloc.
  • Caching Proxies & Topologies: Experience with caching proxies (e.g., Twemproxy, Envoy Redis filter) and designing complex, multi-tiered caching architectures.

What the JD emphasized

  • strict data, resource, and failure isolation for all tenants
  • mitigate head-of-line blocking
  • manage hot keys
  • maximize CPU and memory utilization
  • 99.99%+ availability
  • elastic scalability
  • massive-scale bottlenecks
  • overcoming the limitations of decentralized Gossip protocols
  • mitigating partial failures in distributed systems