Senior Manager, Core Infrastructure Engineering

Oracle Oracle · Enterprise · Austin, TX +1

Senior Manager to lead a team responsible for the development, operation, and improvement of large-scale OCI network fabrics and supporting systems. Requires deep networking expertise, automation, telemetry, performance troubleshooting, and software engineering experience. Will build and improve tools, automation, monitoring, and operational systems for reliability, observability, and efficiency at global cloud scale. Collaborates with various teams to resolve escalations, improve operational readiness, and drive engineering programs. Requires hands-on technical depth and people leadership experience managing engineers who operate and build software for large-scale distributed infrastructure.

What you'd actually do

  1. Manages the development and implementation of scalable distributed systems and components across multiple teams, including the effective use of distributed state management tools.
  2. Manages the strategy for building fault-tolerant components and systems capable of withstanding in-service updates by guiding the implementation of redundancy, replication, and automatic failover mechanisms.
  3. Provides oversight in defining key performance indicators (KPIs) and telemetry to identify gaps or issues in running systems.
  4. Guides teams to be proactive when diagnosing, debugging, and resolving issues in active components and systems to support ongoing operation.
  5. Oversees implementation of robust security measures to protect data and applications in multi-tenant environments, ensuring team strategies incorporate encryption techniques and access controls.

Skills

Required

  • Networking expertise
  • Automation
  • Telemetry
  • Performance troubleshooting
  • Software engineering
  • People leadership
  • System design
  • Scalability
  • Reliability
  • Fault tolerance
  • Distributed systems
  • Cloud infrastructure
  • Security measures
  • Infrastructure as Code (IaC)
  • Incident management
  • Project management

Nice to have

  • Network Clos fabrics
  • GNOC
  • hardware engineering
  • data plane platforms
  • load-shedding
  • throttling
  • rate-limiting
  • service level objectives (SLOs)
  • dashboards
  • alerting mechanisms
  • encryption techniques
  • access controls
  • remediation plans
  • compliance with industry standards and regulations
  • change management plans
  • patching
  • updating
  • rolling back applications

What the JD emphasized

  • large-scale OCI network fabrics
  • automation of Network Clos fabrics
  • telemetry
  • performance troubleshooting
  • software engineering experience
  • global cloud scale
  • distributed infrastructure
  • hyper-scale systems
  • fault-tolerant components
  • service disruptions
  • network unreliability
  • key performance indicators (KPIs)
  • telemetry systems
  • alerting mechanisms
  • Infrastructure as Code (IaC)
  • cloud infrastructure compliance