Senior Principal Core Infrastructure Engineer

Oracle Oracle · Enterprise · India

This role focuses on architecting and leading the design of highly scalable, reliable, and fault-tolerant distributed systems. Key responsibilities include defining scalability requirements, identifying and removing bottlenecks, engineering robust designs for in-service updates and network unreliability, establishing KPIs and telemetry, formally verifying complex features, and leading incident resolution. The role also involves architecting security controls, driving automation (IaC), and overseeing change management plans for patching and updates. While the company is in the enterprise AI domain, this specific role is for core infrastructure engineering, not directly building AI models or products.

What you'd actually do

  1. Leads the architecture and design of interdependent distributed systems, ensuring horizontal and vertical scalability and overall performance including leveraging distributed state management tools.
  2. Architects comprehensive fault-tolerant interdependent systems capable of withstanding in-service updates by implementing advanced redundancy, replication, and automatic failover mechanisms.
  3. Leads efforts and mentors others in troubleshooting, diagnosing, debugging, and resolving critical issues in active systems to support ongoing operation.
  4. Architects comprehensive security measures to protect data and applications in multi-tenant environments, including advanced encryption and access controls.
  5. Oversees the development of comprehensive automation tools and scripts (e.g., Infrastructure as Code (IaC)) for cloud infrastructure management.

Skills

Required

  • Distributed systems architecture
  • System scalability design
  • System reliability engineering
  • Fault tolerance
  • Performance optimization
  • Data plane platforms
  • SLO definition and adherence
  • Telemetry and KPI definition
  • Formal verification (e.g., TLA+)
  • Data replication and synchronization
  • Incident management and troubleshooting
  • Security architecture
  • Infrastructure as Code (IaC)
  • Change management
  • Mentoring

Nice to have

  • Distributed state management tools
  • Load-shedding, throttling, rate-limiting techniques
  • Advanced telemetry systems
  • Cloud infrastructure compliance

What the JD emphasized

  • hyper-scale performance and reliability
  • large-scale operations
  • fault-tolerant designs
  • in-service updates
  • unreliable networks
  • SLO-aligned durability and availability standards
  • critical incident resolution
  • operational readiness
  • comprehensive automation (IaC)
  • safe, automated patching, updates, and rollbacks