Principal Core Infrastructure Engineer

Oracle Oracle · Enterprise · India

This role focuses on leading the development and architecture of scalable, elastic distributed systems. Responsibilities include optimizing code and data paths for high-throughput, hyper-scale workloads, designing fault-tolerant and in-service-upgradable systems, establishing KPIs and telemetry, and implementing robust security and compliance measures. The role also involves developing IaC and automation for safe system updates and operational readiness.

What you'd actually do

  1. Lead the development and implementation, and begin to architect, components of scalable distributed systems that support horizontal and vertical scaling to meet system demands, including leveraging distributed state management tools.
  2. Build and design fault-tolerant components and systems capable of withstanding in-service updates by implementing redundancy, replication, and automatic failover mechanisms.
  3. Define key performance indicators (KPIs) and telemetry to identify gaps or issues in running systems.
  4. Take a proactive role in diagnosing, debugging, and resolving issues in active components and systems to support ongoing operation, and mentor others in these processes.
  5. Implement robust security measures to protect data and applications in multi-tenant environments, including encryption techniques and access controls.

Skills

Required

  • Distributed systems design
  • System architecture
  • Scalability engineering
  • Reliability engineering
  • Performance optimization
  • Fault tolerance
  • High-throughput systems
  • Data plane platforms
  • Data retrieval and storage
  • Data processing
  • Redundancy and replication
  • Failover mechanisms
  • Network unreliability handling
  • Load-shedding
  • Throttling
  • Rate-limiting
  • Service Level Objectives (SLOs)
  • Key Performance Indicators (KPIs)
  • Telemetry systems
  • Dashboards and alerting
  • Fault injection
  • Brownout testing
  • Data replication and synchronization
  • Incident management
  • Root cause analysis
  • Security measures
  • Encryption
  • Access controls
  • Compliance documentation
  • Infrastructure as Code (IaC)
  • Automation scripting
  • Change management
  • Mentoring

Nice to have

  • Distributed state management tools
  • Hyper-scale systems
  • Elastic scaling
  • Moderately complex dashboards
  • Moderately complex test scenarios
  • Cloud infrastructure compliance
  • Patching and updates
  • Rollbacks
  • Technical oversight
  • Problem solving
  • Continuous learning
  • Industry trends and best practices

What the JD emphasized

  • scalability requirements
  • fault-tolerant
  • SLOs
  • security controls
  • compliance documentation
  • automation