What you'd actually do

Manages the development and implementation of scalable distributed systems and components across multiple teams, including the effective use of distributed state management tools.

Manages the strategy for building fault-tolerant components and systems capable of withstanding in-service updates by guiding the implementation of redundancy, replication, and automatic failover mechanisms.

Provides oversight in defining key performance indicators (KPIs) and telemetry to identify gaps or issues in running systems.

Guides teams to be proactive when diagnosing, debugging, and resolving issues in active components and systems to support ongoing operation.

Oversees implementation of robust security measures to protect data and applications in multi-tenant environments, ensuring team strategies incorporate encryption techniques and access controls.

Skills

Required

Networking expertise
Automation
Telemetry
Performance troubleshooting
Software engineering
People leadership
System design
Scalability
Reliability
Fault tolerance
Distributed systems
Cloud infrastructure
Security measures
Infrastructure as Code (IaC)
Incident management
Project management

Nice to have

Network Clos fabrics
GNOC
hardware engineering
data plane platforms
load-shedding
throttling
rate-limiting
service level objectives (SLOs)
dashboards
alerting mechanisms
encryption techniques
access controls
remediation plans
compliance with industry standards and regulations
change management plans
patching
updating
rolling back applications

What the JD emphasized

large-scale OCI network fabrics

automation of Network Clos fabrics

telemetry

performance troubleshooting

software engineering experience

global cloud scale

distributed infrastructure

hyper-scale systems

fault-tolerant components

service disruptions

network unreliability

key performance indicators (KPIs)

telemetry systems

alerting mechanisms

Infrastructure as Code (IaC)

cloud infrastructure compliance

As a Senior Manager, you will lead a team responsible for the development, operation, and improvement of large-scale OCI network fabrics and supporting systems. This role requires deep networking expertise, especially in automation of Network Clos fabrics, telemetry, and performance troubleshooting, combined with software engineering experience. You will build and improve tools, automation, monitoring, and operational systems that make these fabrics more reliable, observable, and efficient at global cloud scale. You will work closely with Network Availability, Network Monitoring, GNOC, hardware engineering, and service teams to resolve complex customer escalations, improve operational readiness, and drive engineering programs that increase performance and availability. The ideal candidate brings both hands-on technical depth and strong people leadership, with experience managing engineers who operate and build software for large-scale distributed infrastructure.

System Design & Architecture – System Scalability:

Manages the development and implementation of scalable distributed systems and components across multiple teams, including the effective use of distributed state management tools.
Oversees code and/or system optimization efforts for large-scale data processing and high-throughput requirements within and across teams to support hyper-scale systems.
Guides teams to define scalability requirements for owned components and ensures design and implementation requirements are met.
Manages the use of data plane platforms to effectively handle large-scale data retrieval, storage, and processing.
Ensures team accurately designs performance and load testing.

System Design & Architecture – System Reliability Design:

Manages the strategy for building fault-tolerant components and systems capable of withstanding in-service updates by guiding the implementation of redundancy, replication, and automatic failover mechanisms.
Develops design strategies for systems to effectively handle service disruptions (e.g., network partitions) by prioritizing consistency, availability, or partition tolerance.
Leads implementation and optimization initiatives across teams for approaches to handle network unreliability, including load-shedding, throttling, and rate-limiting.
Guides teams to design components and systems that are durable and adhere to service level objectives (SLOs), setting expectations for availability and durability of other computing services within the department.

System Design & Architecture – System Reliability Performance:

Provides oversight in defining key performance indicators (KPIs) and telemetry to identify gaps or issues in running systems.
Oversees the building and customization of moderately complex dashboards, telemetry systems, and alerting mechanisms to proactively monitor components and system health.

System Design & Architecture – Correctness / Availability:

Oversees the design and implementation of functional and correctness requirements for feature sets and/or systems in new or existing systems.
Guides teams to design complex test scenarios (e.g., fault-injection, brown-out) to evaluate system correctness.
Directs implementation strategies for data replication and synchronization techniques to maintain data integrity and availability.

Operational Troubleshooting & Incident Management:

Guides teams to be proactive when diagnosing, debugging, and resolving issues in active components and systems to support ongoing operation.
Ensures teams leverage expertise to prevent interruptions, ensuring no maintenance windows are required for customers and users when resolving issues.
Oversees operational readiness protocol and ensures teams remain knowledgeable of owned components and systems to support effective troubleshooting and performance.
Oversees and approves schedules for operational support rotations.

Compliance & Security:

Oversees implementation of robust security measures to protect data and applications in multi-tenant environments, ensuring team strategies incorporate encryption techniques and access controls.
Directs execution of remediation plans to address identified security gaps, promoting continuous improvement of security measures.
Ensures comprehensive documentation and cloud infrastructure compliance with industry standards and regulations.

Automation & Change Management:

Oversees the development and maintenance of automation scripts and tools (e.g., Infrastructure as Code (IaC)) to manage cloud infrastructure.
Works with teams to create and adhere to change management plans for patching, updating, and rolling back applications, and guides development of components to allow for automation of these processes.

Core Responsibilities

Planning & Execution:

Manages multiple medium- to large-scale projects or initiatives across teams, ensuring timelines, deliverables, and budgets (when applicable) are monitored and met.
Provides direction to teams on project work, setting priorities, and aligning with business needs.
Guides teams on adjusting plans to accommodate resource or timeline changes.

Collaboration & Partnership:

Drives cross-functional partnerships to align on expectations and shared objectives across multiple teams.
Coaches team members to develop strategic relationships with business leaders, stakeholders, and external partners to foster collaboration and long-term success.
Promotes inclusivity by actively seeking and listening to diverse perspectives, ensuring others feel heard and respected.

Problem Solving:

Provides direction to multiple teams on addressing complex operational and/or technical issues, as well as guidance on analyzing complex data and/or information to identify solutions.
Reviews and provides insights into unresolved or critical issues, helping teams to identify potential solutions.

Continuous Learning:

Models engaging in continuous learning to deepen expertise and stay ahead of industry trends, integrating best practices into strategic planning.
Leverages feedback to drive personal and team skill improvements.
Identifies skill gaps across teams and empowers team members to pursue learning and knowledge-sharing opportunities that build their expertise in new areas, coaching them to apply learnings to advance the organization.

Continuous Improvement:

Drives teams to collaborate on, develop, and implement ideas to increase the efficiency and effectiveness of processes, protocols, and workflows within and across teams, providing oversight.
Guides teams to adopt new ideas for alternative approaches and methods and encourages feedback for continued improvement.

Performance and Development:

Drives performance across teams by providing feedback and coaching in alignment with performance management processes, guidelines, and expectations.
Discusses development goals with team members, shares opportunities to facilitate career development, and ensures individual goals are aligned with broader organizational goals.
Develops and manages talent acquisition pipeline by leading candidate interviews, monitoring promotion eligibility, and/or orchestrating talent resources.

Disclaimer:

Certain U.S. based or U.S. customer or client-facing roles may be required to comply with applicable requirements, such as immunization/occupational health mandates, and/or drug testing requirements.

Range and benefit information provided in this posting are specific to the stated locations only

US: Hiring Range in USD from: $146,300 to $306,400 per annum. May be eligible for bonus, equity, and compensation deferral.

Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business. Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.

Oracle US offers a comprehensive benefits package which includes the following:

Medical, dental, and vision insurance, including expert medical opinion
Short term disability and long term disability
Life insurance and AD&D
Supplemental life insurance (Employee/Spouse/Child)
Health care and dependent care Flexible Spending Accounts
Pre-tax commuter and parking benefits
401(k) Savings and Investment Plan with company match
Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
11 paid holidays
Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
Paid parental leave
Adoption assistance
Employee Stock Purchase Plan
Financial planning and group legal
Voluntary benefits including auto, homeowner and pet insurance

The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.

Career Level - M3