Director, AI Alignment and Interpretability (remote)

CrowdStrike CrowdStrike · Enterprise · United States · Remote

Lead alignment and interpretability research for security-domain AI systems, focusing on understanding model internals, detecting misuse, and developing methods for training interventions, behavioral constraints, and evaluation protocols. This role involves hands-on research leadership in a novel problem space.

What you'd actually do

  1. Own the alignment and interpretability research agenda for security-domain AI. Set priorities, personally lead the hardest open problems, and develop methods that explain model behavior mechanistically: not just what models do, but why, and what that implies at the edges of their training distribution.
  2. Build and apply techniques for detecting offensive-misuse signal in model internals, including probing for latent representations of vulnerability knowledge, circuit analysis to understand how security-relevant capabilities are encoded, and activation analysis to surface risk that behavioral testing alone would miss. Work closely with the adversarial evaluation team to close the loop between what they find in testing and what you find in the weights.
  3. Develop alignment methodology for security-domain AI and own the evaluation framework that makes it measurable. This includes behavioral constraints, training interventions grounded in interpretability findings, deployment guardrails, and the benchmarks and tests that give the team confidence that models operate within intended bounds as a demonstrated property, not an assertion.
  4. Contribute original research through publications and external engagement. Interpretability for security-specialized models is understudied. Publishing this work is part of the job.
  5. Recruit, develop, and retain a lean team of research scientists. Set a technical bar through your own contributions, not just your expectations.

Skills

Required

  • MS or PhD in machine learning, computer science, or a related field, with research depth in interpretability, AI alignment, or a closely adjacent area.
  • 8+ years in ML research or engineering, with direct experience doing interpretability or alignment research on large language models.
  • Hands-on expertise with mechanistic interpretability methods (probing classifiers, circuit analysis, activation patching, causal tracing, feature visualization) applied to real models.
  • Experience designing and running alignment evaluations: behavioral testing, capability elicitation, red-lining, or similar methodologies rigorous enough to support meaningful safety claims.
  • Track record of leading and growing researchers while remaining an active technical contributor yourself.

Nice to have

  • Background in offensive security, vulnerability research, or adversarial ML, with enough depth to recognize what you find in model internals and reason about misuse potential.
  • Published research in mechanistic interpretability, AI alignment, or AI safety.
  • Experience applying interpretability methods to domain-specialized or fine-tuned models, not only general-purpose foundation models.
  • Familiarity with alignment challenges specific to models with dual-use capability: systems that understand and can reason about offensive techniques, and what that means for responsible deployment.
  • History of working closely with adversarial evaluation or red teams, using behavioral findings to motivate internal analysis and vice versa.

What the JD emphasized

  • alignment and interpretability research
  • security-domain AI
  • mechanistic interpretability
  • AI alignment
  • AI safety
  • offensive security
  • vulnerability research
  • adversarial ML
  • dual-use capability

Other signals

  • alignment research
  • interpretability research
  • security-domain AI
  • mechanistic interpretability
  • AI safety