Software Engineer Iii, Reliability

Box Box · Enterprise · Redwood City, CA · Engineering

Software Engineer III on the Reliability Engineering team at Box, focusing on ensuring platform performance, scalability, and reliability for a content management system that incorporates enterprise AI. The role involves analyzing system behaviors, designing scalable solutions, building testing frameworks, and optimizing backend services and infrastructure.

What you'd actually do

  1. Partner with product and platform engineering teams to assess service designs for scalability and performance risks, ensuring systems are built for long-term growth.
  2. Analyze production workloads, system metrics, and load test results to identify bottlenecks, resource inefficiencies, and architectural scaling limits.
  3. Design and build frameworks for load testing, capacity modeling, and performance validation that enable teams to proactively address scale concerns.
  4. Drive improvements in backend service efficiency, API response times, and resource utilization across Box’s globally distributed platform.
  5. Collaborate with SRE, infrastructure, and platform teams to optimize scaling strategies, auto-scaling policies, and resource allocation.

Skills

Required

  • 3+ years of experience in software engineering, performance engineering, or site reliability engineering, with a focus on backend systems and scalability.
  • Proficient in one or more programming languages such as Go or Java, with an emphasis on building performant services.
  • Strong understanding of distributed systems, concurrency, resource contention, and efficient system design.
  • Hands-on experience analyzing and improving application and system performance across compute, storage, database, and networking layers.
  • Familiarity with load testing and performance benchmarking tools (e.g., Locust, JMeter, Gatling, or custom frameworks).
  • Experience working with cloud infrastructure (AWS, GCP) and container orchestration (Kubernetes).
  • Proficient with observability tools and telemetry systems (e.g., Prometheus, Chronosphere, Grafana, Datadog, ELK).
  • Excellent problem-solving and analytical skills, with a data-driven approach to diagnosing complex system behaviors.
  • Strong collaboration and communication skills; comfortable partnering across engineering teams to drive reliability improvements.

Nice to have

  • Experience with service mesh technologies (Istio, Envoy) and cloud-native networking performance optimization.
  • Exposure to capacity planning, cost optimization, and long-term resource forecasting in cloud environments.
  • Familiarity with incident response processes, post-incident reviews, and reliability improvement practices.
  • Experience contributing to internal platforms, developer tooling, or performance automation frameworks.