Member of Technical Staff - Platform Engineering

Modal Modal · Data AI · New York, NY · Engineering

Modal is an AI infrastructure company that provides GPU access, instant startups, and native storage for training models, running batch jobs, and serving low-latency inference. They are seeking a Member of Technical Staff - Platform Engineering to focus on reliability, performance, and availability as the first reliability-focused hire. The role involves identifying architectural improvements, fostering a reliability culture, designing operational processes, participating in on-call rotations, building monitoring systems, and debugging production issues. Requirements include 5+ years of production code experience, 2+ years of on-call experience, strong cloud skills (AWS preferred), familiarity with scaling and capacity planning, and experience with Kubernetes is a plus. Systems safety research and control theory experience are also a plus.

What you'd actually do

  1. Identify architectural changes to improve reliability, performance and availability.
  2. Foster a culture of reliability across Modal’s engineering organization.
  3. Design and implement key operational processes such as deployments, upgrades, rollbacks, and postmortem review.
  4. Join a core engineering team and participate in on-call rotation, responding to production incidents.
  5. Build monitoring systems that ensure the highest quality service for our customers.

Skills

Required

  • 5+ years of experience writing high-quality production code
  • 2+ years of on-call experience for critical production services
  • Strong cloud skills
  • deep familiarity with at least one hyperscaler cloud (AWS preferred)
  • Familiarity with auto scaling, fleet management, and capacity planning at scale
  • Ability to work in-person in our NYC, SF, or Stockholm offices
  • Ability to participate in on-call rotation and respond to production incidents

Nice to have

  • Experience owning and scaling Kubernetes clusters to thousands of nodes
  • Experience with systems safety research (e.g. STAMP) and control theory

What the JD emphasized

  • reliability dramatically
  • first reliability-focused hire
  • define the company’s reliability systems and practices
  • critical partner for our development teams
  • on-call rotation
  • responding to production incidents
  • highest quality service for our customers
  • debug production issues across all services and levels of the stack
  • on-call experience for critical production services