Senior Principal Core Infrastructure En… at Oracle

What you'd actually do

Leads the architecture and design of interdependent distributed systems, ensuring horizontal and vertical scalability and overall performance including leveraging distributed state management tools.

Architects comprehensive fault-tolerant interdependent systems capable of withstanding in-service updates by implementing advanced redundancy, replication, and automatic failover mechanisms.

Leads efforts and mentors others in troubleshooting, diagnosing, debugging, and resolving critical issues in active systems to support ongoing operation.

Architects comprehensive security measures to protect data and applications in multi-tenant environments, including advanced encryption and access controls.

Oversees the development of comprehensive automation tools and scripts (e.g., Infrastructure as Code (IaC)) for cloud infrastructure management.

Skills

Required

Distributed systems architecture
System scalability design
System reliability engineering
Fault tolerance
Performance optimization
Data plane platforms
SLO definition and adherence
Telemetry and KPI definition
Formal verification (e.g., TLA+)
Data replication and synchronization
Incident management and troubleshooting
Security architecture
Infrastructure as Code (IaC)
Change management
Mentoring

Nice to have

Distributed state management tools
Load-shedding, throttling, rate-limiting techniques
Advanced telemetry systems
Cloud infrastructure compliance

Architects and leads design of interdependent, elastic distributed systems for hyper‑scale performance and reliability. Defines scalability requirements with stakeholders, identifies and removes bottlenecks, and leverages data plane platforms for large‑scale operations. Engineers fault‑tolerant designs that sustain in‑service updates, handle partitions and unreliable networks (load‑shedding, throttling, rate‑limiting), and set SLO‑aligned durability and availability standards. Establishes KPIs and advanced telemetry, formally verifies complex features, and defines replication/synchronization strategies for integrity. Leads critical incident resolution and operational readiness, holding partners to SOPs while mentoring others. Architects advanced security controls and oversees remediation, and drives comprehensive automation (IaC) and change plans enabling safe, automated patching, updates, and rollbacks.

Key Responsibilities System Design & Architecture - System Scalability: -Leads the architecture and design of interdependent distributed systems, ensuring horizontal and vertical scalability and overall performance including leveraging distributed state management tools. -Identifies performance and scalability bottlenecks to optimize code and/or systems for large-scale data processing and high-throughput requirements to improve performance for hyper-scale systems. -Collaborates with stakeholders to define system scalability requirements, ensuring the defined requirements meet customer expectations. -Designs interdependent systems to scale with elasticity (e.g., effectively scaling both up and down). -Leverages and implements data plane platforms for large-scale data operations. -Evaluates if systems are meeting nonfunctional scalability requirements, and proactively anticipates system failures to meet the requirements. System Design & Architecture - System Reliability Design: -Architects comprehensive fault-tolerant interdependent systems capable of withstanding in-service updates by implementing advanced redundancy, replication, and automatic failover mechanisms. -Leads the design of systems that effectively handle service disruptions (e.g., network partitions by prioritizing consistency, availability, or partition tolerance). -Optimizes and design advanced techniques for handling network unreliability, including load-shedding, throttling, and rate-limiting. -Designs systems that are durable and adhere to service level objectives (SLOs), driving standards for availability and durability of other computing services within the department. System Design & Architecture - System Reliability Performance: -Defines key performance indicators (KPIs) and telemetry to proactively identify risks, gaps, or cyclical dependencies in running systems. -Leads the creation and customization of complex dashboards, telemetry systems, and alerting mechanisms to proactively monitor and ensure optimal system health. System Design & Architecture - Correctness / Availability: -Evaluates if systems are meeting functional and correctness requirements, and identifies improvement opportunities. -Formally verifies complex features (e.g., via TLA+) to ensure system design correctness. -Develops strategies for data replication and synchronization to maintain data integrity and availability. Operational Troubleshooting & Incident Management: -Leads efforts and mentors others in troubleshooting, diagnosing, debugging, and resolving critical issues in active systems to support ongoing operation. -Implements advanced strategies to prevent interruptions, ensuring no maintenance windows are required for customers and users when resolving issues. -Maintains expertise in dependencies and owned components and systems to ensure effective troubleshooting and performance. -Reviews and approves operational readiness, standard operating procedures, and holds internal partners accountable for meeting those standards. -Provides guidance on coordinating operational support rotations, providing expert guidance in incident response and conducting comprehensive root cause investigations. Compliance & Security: -Architects comprehensive security measures to protect data and applications in multi-tenant environments, including advanced encryption and access controls. -Develops and oversees the execution of remediation plans to address identified security vulnerabilities. -Mentors others to ensures that cloud infrastructure is in compliance with industry standards and regulations and that documentation is up-to-date. Automation & Change Management: -Oversees the development of comprehensive automation tools and scripts (e.g., Infrastructure as Code (IaC)) for cloud infrastructure management. -Creates change management plans for patching, updating, and rolling back applications, and designs systems to allow for automation of these processes.

Core Responsibilities Planning & Execution: -Oversees and tracks timelines and/or budgets for large-scale projects or initiatives to ensure timely progress and adherence to requirements. Strategically balances multiple projects and adjusts plans to accommodate shifts in resources or schedules, mitigating risks to project outcomes. Collaboration & Partnership: -Fosters collaboration across the line of business and with external stakeholders to ensure alignment of expectations and strategic objectives. Builds and maintains partnerships with business leaders, stakeholders, and/or customers to address barriers and contribute to organizational success. Drives transparency and inclusivity by actively seeking, listening to, and leveraging diverse perspectives. Problem Solving: -Develops and refines problem-solving strategies and serves as an escalation point for complex issues across multiple projects or teams. Leads the analysis of complex data and/or information to identify patterns and root causes, reviewing recommendations for resolution, and implementing solutions that prevent future issues. Continuous Learning: -Builds expertise within one's area and actively pursues learning opportunities to stay current with the latest industry trends and best practices. Acts as a role model for continuous learning by identifying new areas to grow skills. Applies new knowledge to drive advancement and mentors others to do the same, fostering a culture of continuous learning and knowledge sharing. Continuous Improvement: -Develops and leads efforts to implement ideas that increase the efficiency and effectiveness of processes, protocols, and workflows across teams, as well as evaluates the impact on key stakeholders. Actively encourages team to recommend ideas for improvement and provide feedback on approaches and methods for continued improvement. Performance and Development: -Leverages subject matter expertise to sustain the talent development pipeline by participating in candidate interviews, assessing candidates, and providing hiring recommendations.

Basic Qualifications

BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience
10+ years of total experience in software development
Demonstrated ability to write great code using Java, GoLang, C#, or similar OO languages
Proven ability to deliver products and experience with the full software development lifecycle
Experience working on large-scale, highly distributed services infrastructure
Experience working in an operational environment with mission-critical tier-one livesite servicing
Systematic problem-solving approach, strong communication skills, a sense of ownership, and drive
Experience designing architectures that demonstrate deep technical depth in one area, or span many products, to enable high availability, scalability, market-leading features and flexibility to meet future business demands

Preferred Qualifications

Experience as technical lead on a large scale cloud service
Hands-on experience developing and maintaining services on a public cloud platform (e.g., AWS, Azure, Oracle)
Experience working on Kubernetes
Knowledge of Infrastructure as Code (IAC) languages, preferably Terraform
Strong knowledge of databases (SQL and NoSQL)
Strong knowledge of Computer Networking (OSI layers, HTTP, DNS, TCP/IP, DHCP, Routers, Gateways, Subnets, etc.)
Knowledge of Linux internals, Linux/Unix troubleshooting skills
Familiarity with host virtualization technologies (KVM, Containers, Docker, etc.)
Able to effectively communicate technical ideas verbally and in writing (technical proposals, design specs, architecture diagrams and presentations)
Experience with hiring, mentorship and raising the talent bar across the organization

Career Level - IC4.5