What you'd actually do

Designs, implements, and optimizes components in distributed systems with an emphasis on scalability, resiliency, and operability.

Delivers features and load/performance tests; leverages data plane platforms and distributed state tools for high-volume retrieval, storage, and processing; and reviews peers’ implementations for scalability compliance.

Builds fault-tolerant paths (redundancy, replication, automatic failover), applies recovery‑oriented principles, and implements retries, circuit breakers, and timeouts.

Proactively detects and mitigates issues via tests, alarms, dashboards, and telemetry; authors runbooks and participates in incident response and RCAs.

Implements standard replication and synchronization, develops automation/IaC for troubleshooting and maintenance, and applies advanced security controls (encryption, access, remediation) while ensuring change, compliance, and documentation standards are met.

Skills

Required

distributed systems design
scalability optimization
resiliency engineering
operability
data plane platforms
distributed state management tools
high-volume data processing
fault tolerance
recovery-oriented computing
retries
circuit breakers
timeouts
testing
alarms
dashboards
telemetry
runbook authoring
incident response
root cause analysis
data replication
synchronization
automation scripting
Infrastructure as Code (IaC)
security controls
encryption
access controls
remediation
change management
compliance
documentation

Nice to have

performance testing
load testing
multi-tenant environments

Designs, implements, and optimizes components in distributed systems with an emphasis on scalability, resiliency, and operability. Delivers features and load/performance tests; leverages data plane platforms and distributed state tools for high-volume retrieval, storage, and processing; and reviews peers’ implementations for scalability compliance. Builds fault-tolerant paths (redundancy, replication, automatic failover), applies recovery‑oriented principles, and implements retries, circuit breakers, and timeouts. Proactively detects and mitigates issues via tests, alarms, dashboards, and telemetry; authors runbooks and participates in incident response and RCAs. Implements standard replication and synchronization, develops automation/IaC for troubleshooting and maintenance, and applies advanced security controls (encryption, access, remediation) while ensuring change, compliance, and documentation standards are met.

Key Responsibilities System Design & Architecture - System Scalability: –Implements and contributes to the development for components of distributed systems that support horizontal and vertical scaling including leveraging distributed state management tools. –Optimizes code and/or systems for large-scale data processing in large-scale systems. –Implements scalability requirements for assigned components and reviews implementation of team members. –Leverages components of data plane platforms to handle large-scale data retrieval, storage, and processing. –Implements performance and load testing. System Design & Architecture - System Reliability Design: –Collaborates with team to build fault-tolerant components capable of withstanding in-service updates by implementing redundancy, replication, and automatic failover mechanisms. –Applies recovery oriented computing principles to design components that effectively handle service disruptions. –Implements retry mechanisms, circuit breakers, and timeouts to help handle network unreliability. System Design & Architecture - System Reliability Performance: –Implements tests and alarm configurations to proactively detect and address issues/failures. –Supports efforts to recover from failures by drafting and executing runbooks and operational procedures. –Builds and customizes dashboards, telemetry systems, and alerting mechanisms to monitor component health. System Design & Architecture - Correctness / Availability: –Designs and implements functional requirements and testing for assigned features within an existing system. –Implements tests scenarios (e.g., fault-injection, brown-out) to evaluate system correctness. –Implements standard data replication and synchronization techniques to maintain data integrity and availability. Operational Troubleshooting & Incident Management: –Diagnoses, debugs, and resolves issues in system components to support ongoing operation. –Implements basic strategies to prevent interruptions, ensuring no maintenance windows are required for customers and users when resolving issues. –Designs and implements automation scripts and tooling used to troubleshoot operational issues. –Participates in operational support rotations, assisting in incident responses and root cause investigations. Compliance & Security: –Applies advanced security measures to protect data and applications in multi-tenant environments, including encryption and access controls. –Implements remediation plans to continuously improve security. –Collaborates with the team to ensure cloud infrastructure complies with relevant industry standards and regulations and that documentation is up-to-date Automation & Change Management: –Maintains automation scripts and tools (e.g., Infrastructure as Code (IaC)) for managing cloud infrastructure. –Adheres to change management plans for patching, updating, and rolling back applications.

Core Responsibilities Planning & Execution: –Track timelines with minimal supervision, ensuring work is completed in a timely manner and is in alignment with project requirements. –Prioritize and adjust work as resources or timelines change, with some guidance Collaboration & Partnership: –Collaborates across teams to align on expectations and achieve shared objectives. Builds and maintains a comprehensive understanding of business, stakeholder, and/or customer needs to build and support effective partnerships. Actively listens to diverse perspectives and asks questions to ensure understanding of others. Problem Solving: –Independently identifies and addresses standard and non-standard issues in accordance with standard practices, escalating more complex issues as appropriate. Analyzes data and/or information from multiple sources to troubleshoot standard and non-standard errors. Contributes to knowledge sharing and best practices Continuous Learning: –Embraces continuous learning by actively seeking to build knowledge and new skills and/or tools, and staying current with industry trends and best practices. Seeks out and leverages feedback and training to improve skills. Contributes to a culture of continuous learning and knowledge sharing with team members.. Continuous Improvement: –Develops ideas and recommends updates to increase the efficiency and effectiveness of processes, protocols, and workflows within a team. Seeks input from team members on alternative approaches and methods for improving work.

Career Level - IC3