Sr. Engineer II - Epics, Ng-siem (hybrid)

CrowdStrike CrowdStrike · Enterprise · London, Ireland, United Kingdom

CrowdStrike is seeking a Sr. Engineer II for their NG-SIEM EPICS team to own the reliability and scalability of their large-scale SIEM platform. This role involves building observability, automation, and scaling systems to ensure the health and performance of the entire pipeline, from ingest to search and workflow execution. The engineer will also be involved in incident response, capacity planning, cost management, and cross-team collaboration to improve platform resilience and efficiency.

What you'd actually do

  1. Design, build, and maintain monitoring and synthetic test suites that provide deep visibility into the health of the entire NG-SIEM pipeline — from ingest through search and workflow execution — enabling rapid root cause analysis across component boundaries.
  2. Engineer orchestrated scaling solutions that treat the NG-SIEM pipeline as a unified system, proportionally increasing resources across all dependent components (Kafka, ingest pipelines, downstream services) to eliminate cascading bottleneck patterns.
  3. Serve as a subject matter expert during platform-wide incidents (P2 and above), applying cross-service knowledge to diagnose and resolve multi-component failures. Partake in follow-the-sun on-call rotations, providing incident commander coordination for critical platform-wide events.
  4. Build and refine models for end-to-end capacity forecasting that account for all pipeline dimensions, including partner team dependencies (data services, GPS). Develop tooling to continuously track and surface cost drivers across the platform.
  5. Transform manual standard operating procedures into automated remediation workflows — including pipeline-wide scaling responses, CID rebalancing, and infrastructure healing — with the goal of resolving issues before customers are impacted.

Skills

Required

  • software engineering
  • site reliability engineering
  • platform engineering
  • large-scale distributed systems
  • systems programming language (Go, Java, Rust, or C++)
  • scripting language (Python, Bash)
  • end-to-end observability
  • monitoring pipelines
  • SLIs/SLOs
  • dashboards
  • diagnose and resolve complex incidents
  • coordinated capacity planning
  • scaling
  • streaming platforms (Kafka or similar)
  • backpressure
  • partition management
  • consumer group dynamics
  • infrastructure-as-code
  • CI/CD pipelines
  • automated deployment practices
  • written and verbal communication skills

Nice to have

  • incident commander coordination

What the JD emphasized

  • own the reliability and scalability
  • security industry's largest SIEM platform
  • treating these as software engineering problems rather than purely operational ones
  • deep cross-service expertise and coordinated action
  • engineer who builds the observability, automation, and scaling systems that keep the entire platform performing
  • high-ownership technical leaders
  • mission: to stop breaches
  • 10+ years of experience in software engineering, site reliability engineering, or platform engineering
  • significant time spent on large-scale distributed systems
  • ability to make pragmatic tradeoffs between short-term delivery needs and long-term platform goals
  • Deep experience with end-to-end observability
  • building monitoring pipelines, defining SLIs/SLOs, and creating dashboards that drive actionable insights across multi-service architectures
  • Demonstrated ability to diagnose and resolve complex incidents spanning multiple distributed components operating 24/7
  • Experience with coordinated capacity planning and scaling for systems with significant infrastructure footprints
  • Hands-on experience with streaming platforms (Kafka or similar) and understanding of backpressure, partition management, and consumer group dynamics at scale