What you'd actually do

Lead the development and implementation, and begin to architect, components of scalable distributed systems that support horizontal and vertical scaling to meet system demands, including leveraging distributed state management tools.

Build and design fault-tolerant components and systems capable of withstanding in-service updates by implementing redundancy, replication, and automatic failover mechanisms.

Define key performance indicators (KPIs) and telemetry to identify gaps or issues in running systems.

Take a proactive role in diagnosing, debugging, and resolving issues in active components and systems to support ongoing operation, and mentor others in these processes.

Implement robust security measures to protect data and applications in multi-tenant environments, including encryption techniques and access controls.

Skills

Required

Distributed systems design
System architecture
Scalability engineering
Reliability engineering
Performance optimization
Fault tolerance
High-throughput systems
Data plane platforms
Data retrieval and storage
Data processing
Redundancy and replication
Failover mechanisms
Network unreliability handling
Load-shedding
Throttling
Rate-limiting
Service Level Objectives (SLOs)
Key Performance Indicators (KPIs)
Telemetry systems
Dashboards and alerting
Fault injection
Brownout testing
Data replication and synchronization
Incident management
Root cause analysis
Security measures
Encryption
Access controls
Compliance documentation
Infrastructure as Code (IaC)
Automation scripting
Change management
Mentoring

Nice to have

Distributed state management tools
Hyper-scale systems
Elastic scaling
Moderately complex dashboards
Moderately complex test scenarios
Cloud infrastructure compliance
Patching and updates
Rollbacks
Technical oversight
Problem solving
Continuous learning
Industry trends and best practices

Leads development and begins architecting components of scalable, elastic distributed systems. Defines and enforces scalability requirements for owned components; optimizes code and data paths for high‑throughput, hyper‑scale workloads; and leverages data plane platforms for large‑scale retrieval, storage, and processing. Designs fault‑tolerant, in‑service‑upgradable systems using redundancy, replication, failover, and policies for partitions, applying load‑shedding, throttling, and rate‑limiting to handle network unreliability while meeting SLOs. Establishes KPIs and telemetry; builds proactive dashboards and alerts; and designs complex validation (fault injection, brownouts), replication, and synchronization for correctness and durability. Proactively diagnoses and resolves production issues, mentors peers, and ensures operational readiness. Implements robust security controls, executes remediation, maintains compliance documentation, and develops IaC and automation that enable safe patching, updates, and rollbacks within change‑management plans.

Key Responsibilities System Design & Architecture - System Scalability: –Lead the development and implementation, and begin to architect, components of scalable distributed systems that support horizontal and vertical scaling to meet system demands, including leveraging distributed state management tools. –Optimize code and/or systems for large-scale data processing and high-throughput requirements to support hyper-scale systems. –Define scalability requirements for owned components and ensure design and implementation requirements are met. –Design systems to scale with elasticity (e.g., effectively scaling both up and down). –Leverage data plane platforms to effectively handle large-scale data retrieval, storage, and processing. –Design performance and load testing. System Design & Architecture - System Reliability Design: –Build and design fault-tolerant components and systems capable of withstanding in-service updates by implementing redundancy, replication, and automatic failover mechanisms. –Design systems to effectively handle service disruptions (e.g., network partitions) by prioritizing consistency, availability, or partition tolerance. –Implement and optimize approaches to handle network unreliability, including load-shedding, throttling, and rate-limiting. –Design components and systems that are durable and adhere to service level objectives (SLOs), setting expectations for availability and durability of other computing services within the department. System Design & Architecture - System Reliability Performance: –Define key performance indicators (KPIs) and telemetry to identify gaps or issues in running systems. –Build and customize moderately complex dashboards, telemetry systems, and alerting mechanisms to proactively monitor components and system health.. System Design & Architecture - Correctness / Availability: –Design and implement functional and correctness requirements for feature sets and/or systems in new or existing systems. –Design complex test scenarios (e.g., fault-injection, brown-out) to evaluate system correctness. –Implement data replication and synchronization techniques to maintain data integrity and availability. Operational Troubleshooting & Incident Management: –Take a proactive role in diagnosing, debugging, and resolving issues in active components and systems to support ongoing operation, and mentor others in these processes. –Implement strategies to prevent interruptions, ensuring no maintenance windows are required for customers and users when resolving issues. –Maintain expertise in owned components and systems to ensure effective troubleshooting and performance. –Meet operational readiness expectations through design and implementation. –Serve in operational support rotations, providing guidance in incident response and root cause investigations. Compliance & Security: –Implement robust security measures to protect data and applications in multi-tenant environments, including encryption techniques and access controls. –Execute remediation plans to address identified security gaps. –Ensure cloud infrastructure is in compliance with industry standards and regulations and that documentation is up to date. Automation & Change Management: –Develop and maintain automation scripts and tools (e.g., Infrastructure as Code (IaC)) to manage cloud infrastructure. –Create and adhere to change management plans for patching, updating, and rolling back applications, and begin designing systems and components to allow for automation of these processes.

Core Responsibilities Planning & Execution: –Manages and coordinates moderately complex tasks, monitoring timelines and deliverables to ensure timely completion and adherence to requirements for a moderately-sized project or initiative. Efficiently delegates, monitors, and prioritizes work across multiple projects, providing technical oversight and adjusting plans to address shifts in resources or timelines. Collaboration & Partnership: –Collaborates across the organization to align on expectations and achieve shared objectives. Leverages understanding of business leaders, stakeholders, and/or customers to ensure proposed solutions meet their needs. Supports inclusivity by actively seeking and listening to diverse perspectives, ensuring others feel heard and respected. Problem Solving: –Identifies and addresses moderately complex issues by analyzing a wide range of data and/or information to identify solutions in accordance with standard practices. Proactively escalates unresolved or critical issues with a thorough assessment and suggests potential solutions. Reviews, contributes to, and documents problem solving strategies. Continuous Learning: –Pursues learning opportunities to expand knowledge and skills and/or tools in new areas and stays abreast of the latest industry trends and best practices. Proactively seeks and leverages ongoing feedback and training to improve skills. Coaches and mentors junior team members, fostering continuous learning and knowledge sharing within and across teams. Continuous Improvement: –Develops ideas, recommends updates, and/or collaborates on the implementation of process improvements to increase the efficiency and effectiveness of processes, protocols, and workflows across teams, and evaluates the impact on key stakeholders. Solicits feedback from others on ideas for alternative approaches and methods for continued improvement. Performance and Development: –Contributes to the talent development pipeline by participating in candidate interviews, assessing candidates, and providing hiring recommendations.

Career Level - IC4

Principal Core Infrastructure Engineer

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized