Sr. Engineer Ii, Epics, Ng-siem (hybrid)

CrowdStrike CrowdStrike · Enterprise · Austin, TX

CrowdStrike is seeking a Sr. Engineer II for their NG-SIEM EPICS team to own the reliability and scalability of their large-scale SIEM platform. This role involves building observability, automation, and scaling solutions for a complex distributed system processing massive amounts of security data. Responsibilities include end-to-end observability design, coordinated scaling engineering, incident response, capacity planning, cost management, and automation of operational procedures. The ideal candidate has extensive experience in software/SRE/platform engineering with large-scale distributed systems, proficiency in systems and scripting languages, and deep experience with observability and streaming platforms like Kafka.

What you'd actually do

  1. End-to-end observability: Design, build, and maintain monitoring and synthetic test suites that provide deep visibility into the health of the entire NG-SIEM pipeline — from ingest through search and workflow execution — enabling rapid root cause analysis across component boundaries.
  2. Coordinated scaling: Engineer orchestrated scaling solutions that treat the NG-SIEM pipeline as a unified system, proportionally increasing resources across all dependent components (Kafka, ingest pipelines, downstream services) to eliminate cascading bottleneck patterns.
  3. Incident response engineering: Serve as a subject matter expert during platform-wide incidents (P2 and above), applying cross-service knowledge to diagnose and resolve multi-component failures. Partake in follow-the-sun on-call rotations, providing incident commander coordination for critical platform-wide events.
  4. Capacity planning and cost management: Build and refine models for end-to-end capacity forecasting that account for all pipeline dimensions, including partner team dependencies (data services, GPS). Develop tooling to continuously track and surface cost drivers across the platform.
  5. Automation and runbooks: Transform manual standard operating procedures into automated remediation workflows — including pipeline-wide scaling responses, CID rebalancing, and infrastructure healing — with the goal of resolving issues before customers are impacted.

Skills

Required

  • 10+ years of experience in software engineering, site reliability engineering, or platform engineering
  • significant time spent on large-scale distributed systems
  • pragmatic tradeoffs between short-term delivery needs and long-term platform goals
  • strong proficiency in at least one systems programming language (Go, Java, Rust, or C++)
  • strong proficiency in one scripting language (Python, Bash)
  • Deep experience with end-to-end observability — building monitoring pipelines, defining SLIs/SLOs, and creating dashboards that drive actionable insights across multi-service architectures
  • Demonstrated ability to diagnose and resolve complex incidents spanning multiple distributed components operating 24/7
  • Experience with coordinated capacity planning and scaling for systems with significant infrastructure footprints
  • Hands-on experience with streaming platforms (Kafka or similar) and understanding of back pressure, partition management, and consumer group dynamics at scale
  • Familiarity with infrastructure-as-code, CI/CD pipelines, and automated deployment practices

Nice to have

  • Experience

What the JD emphasized

  • own the reliability and scalability
  • security industry's largest SIEM platform
  • end-to-end health
  • deep cross-service expertise
  • engineer who builds the observability, automation, and scaling systems
  • keep the entire platform performing
  • end-to-end observability
  • multi-service architectures
  • diagnose and resolve complex incidents spanning multiple distributed components
  • coordinated capacity planning and scaling
  • significant infrastructure footprints
  • streaming platforms (Kafka or similar)
  • back pressure, partition management, and consumer group dynamics at scale