What you'd actually do

Lead the design, development, and operation of cloud-scale observability platforms supporting metrics, logs, traces, and related telemetry data.

Architect and implement highly scalable, resilient, and cost-efficient telemetry collection, ingestion, processing, storage, and query systems.

Drive the evolution of end-to-end observability pipelines, from instrumentation and data collection through real-time analytics and long-term retention.

Design and optimize distributed systems capable of ingesting and processing massive volumes of telemetry data with stringent latency and availability requirements.

Develop scalable storage and indexing solutions for high-cardinality metrics, large-scale log analytics, and distributed tracing workloads.

Skills

Required

design, development, and operation of cloud-scale observability platforms
metrics, logs, traces, and related telemetry data
highly scalable, resilient, and cost-efficient telemetry collection, ingestion, processing, storage, and query systems
end-to-end observability pipelines
distributed systems
high-throughput telemetry ingestion
large-scale data processing
cost-efficient storage
low-latency query execution
multi-tenant reliability
operational excellence
cloud-native observability platforms
massive volumes of telemetry data
stringent latency and availability requirements
scalable storage and indexing solutions
high-cardinality metrics
large-scale log analytics
distributed tracing workloads
fast, reliable, and intuitive access to observability data
performance bottlenecks
reliability, fault tolerance, scalability, security, and operational excellence
hyperscale cloud environments
technical strategy and architectural decisions
mentoring senior and junior engineers
technical leadership
engineering best practices
collaboration with product management, architects, SREs, and engineering teams
troubleshooting and root-cause analysis
emerging trends, technologies, and best practices in observability, distributed systems, data processing, and cloud-native architectures

Nice to have

AI/ML experience
experience with AI/ML model training, serving, or evaluation

What the JD emphasized

cloud-scale observability platforms

massive scale

distributed systems

high-throughput telemetry ingestion

large-scale data processing

cost-efficient storage

low-latency query execution

multi-tenant reliability

operational excellence

cloud-native observability platforms

massive volumes of telemetry data

stringent latency and availability requirements

scalable storage and indexing solutions

high-cardinality metrics

large-scale log analytics

distributed tracing workloads

fast, reliable, and intuitive access to observability data

performance bottlenecks

reliability, fault tolerance, scalability, security, and operational excellence

hyperscale cloud environments

Join the team building Oracle Cloud Infrastructure's state of the art observability platform, powering visibility and operational intelligence for both OCI's internal cloud services and customers running mission-critical workloads on OCI. OCI Monitoring and Logging serve as foundational platforms used by OCI engineering teams to operate and troubleshoot hundreds of cloud services while also enabling customers to monitor, analyze, and gain insights into their own applications and infrastructure. This unique position offers the opportunity to build observability solutions that operate at massive scale, serving the demanding needs of OCI's own services as well as a global customer base. Our team tackles some of the industry's most challenging distributed systems problems, including high-throughput telemetry ingestion, large-scale data processing, cost-efficient storage, low-latency query execution, multi-tenant reliability, and operational excellence. If you are passionate about building cloud-native observability platforms that power both the cloud itself and the customers who depend on it, we'd love to talk to you.

Lead the design, development, and operation of cloud-scale observability platforms supporting metrics, logs, traces, and related telemetry data.
Architect and implement highly scalable, resilient, and cost-efficient telemetry collection, ingestion, processing, storage, and query systems.
Drive the evolution of end-to-end observability pipelines, from instrumentation and data collection through real-time analytics and long-term retention.
Design and optimize distributed systems capable of ingesting and processing massive volumes of telemetry data with stringent latency and availability requirements.
Develop scalable storage and indexing solutions for high-cardinality metrics, large-scale log analytics, and distributed tracing workloads.
Build and enhance query, search, and retrieval services that deliver fast, reliable, and intuitive access to observability data.
Collaborate with product management, architects, SREs, and engineering teams to define and deliver next-generation observability capabilities.
Identify and resolve performance bottlenecks across the observability stack, including ingestion, storage, indexing, aggregation, and query execution.
Design systems with a strong focus on reliability, fault tolerance, scalability, security, and operational excellence.
Drive technical strategy and architectural decisions for observability services operating at hyperscale cloud environments.
Mentor senior and junior engineers, provide technical leadership, and foster engineering best practices across the organization.
Partner with service teams to improve instrumentation, telemetry quality, and operational visibility across cloud services.
Establish and monitor key service health, scalability, performance, and cost-efficiency metrics for observability platforms.
Lead troubleshooting and root-cause analysis efforts for complex distributed systems and large-scale production environments.
Stay current with emerging trends, technologies, and best practices in observability, distributed systems, data processing, and cloud-native architectures.

Disclaimer:

Certain U.S. based or U.S. customer or client-facing roles may be required to comply with applicable requirements, such as immunization/occupational health mandates, and/or drug testing requirements.

Range and benefit information provided in this posting are specific to the stated locations only

US: Hiring Range in USD from: $135,200 to $306,400 per annum. May be eligible for bonus, equity, and compensation deferral.

Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business. Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.

Oracle US offers a comprehensive benefits package which includes the following:

Medical, dental, and vision insurance, including expert medical opinion
Short term disability and long term disability
Life insurance and AD&D
Supplemental life insurance (Employee/Spouse/Child)
Health care and dependent care Flexible Spending Accounts
Pre-tax commuter and parking benefits
401(k) Savings and Investment Plan with company match
Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
11 paid holidays
Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
Paid parental leave
Adoption assistance
Employee Stock Purchase Plan
Financial planning and group legal
Voluntary benefits including auto, homeowner and pet insurance

The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.

Career Level - IC5

Lead the design, development, and operation of cloud-scale observability platforms supporting metrics, logs, traces, and related telemetry data.
Architect and implement highly scalable, resilient, and cost-efficient telemetry collection, ingestion, processing, storage, and query systems.
Drive the evolution of end-to-end observability pipelines, from instrumentation and data collection through real-time analytics and long-term retention.
Design and optimize distributed systems capable of ingesting and processing massive volumes of telemetry data with stringent latency and availability requirements.
Develop scalable storage and indexing solutions for high-cardinality metrics, large-scale log analytics, and distributed tracing workloads.
Build and enhance query, search, and retrieval services that deliver fast, reliable, and intuitive access to observability data.
Collaborate with product management, architects, SREs, and engineering teams to define and deliver next-generation observability capabilities.
Identify and resolve performance bottlenecks across the observability stack, including ingestion, storage, indexing, aggregation, and query execution.
Design systems with a strong focus on reliability, fault tolerance, scalability, security, and operational excellence.
Drive technical strategy and architectural decisions for observability services operating at hyperscale cloud environments.
Mentor senior and junior engineers, provide technical leadership, and foster engineering best practices across the organization.
Partner with service teams to improve instrumentation, telemetry quality, and operational visibility across cloud services.
Establish and monitor key service health, scalability, performance, and cost-efficiency metrics for observability platforms.
Lead troubleshooting and root-cause analysis efforts for complex distributed systems and large-scale production environments.
Stay current with emerging trends, technologies, and best practices in observability, distributed systems, data processing, and cloud-native architectures.