Manager, Sre Risk Advisory and Oversight

Capital One Capital One · Banking · McLean, VA +1

This role focuses on second-line oversight and risk assessment of Site Reliability Engineering (SRE) and software engineering practices within a financial services company. The Manager will evaluate cloud implementations, resilience architectures, and the integration of emerging automation and Generative AI toolsets to ensure reliability and stability. Key responsibilities include conducting technical risk analyses, supporting effective challenge against enterprise risk appetites, building reporting materials for executives, and assessing core SRE pillars like SLIs/SLOs and CI/CD pipelines. While the role evaluates AI/ML tooling, its core function is risk management and advisory, not direct AI/ML model development or deployment.

What you'd actually do

  1. Perform Deep-Dive Risk Analysis: Conduct independent, technical risk assessments of cloud infrastructure architectures, software delivery lifecycles, and observability frameworks to identify systemic resilience and stability risks.
  2. Support Effective Challenge: Evaluate first-line cloud engineering practices against enterprise risk appetites, ensuring robust strategies are maintained for automation, system resiliency, performance, and monitoring.
  3. Build Storytelling & Reporting Materials: Partner with team leadership (Sr. Managers and Directors) to translate complex, highly technical engineering data into structured risk reports, presentation decks, and executive storytelling materials.
  4. SRE Subject Matter Expertise: Serve as a trusted technical analyst on core SRE pillars, assessing the design and maturity of Service Level Indicators/Objectives (SLIs/SLOs), error budgets, release pipelines (CI/CD), and toil reduction efforts.
  5. Evaluate AI & Tech Integration: Actively evaluate the integration of cutting-edge technologies—specifically cloud-native stacks, containerization, and the application of emerging Gen AI/ML tooling within software delivery—to ensure reliable operational boundaries.

Skills

Required

  • Bachelor's Degree or military experience
  • At least 4 years of experience in Technology Management, Software Engineering, Site Reliability Engineering, or Cyber Risk Management
  • At least 2 years of experience with cloud implementations (AWS, GCP, or Azure)
  • At least 1 year of experience with open-source programming languages

Nice to have

  • Master's Degree in Computer Science, Computer Engineering, or a relevant technical discipline.
  • Professional cloud or infrastructure certification (AWS Certified Solutions Architect, AWS SysOps Administrator).
  • Experience analyzing or utilizing enterprise monitoring, observability, and alerting toolsets (Splunk, Prometheus, Datadog, ELK, PagerDuty).
  • Demonstrated understanding of cloud-native systems, containerization stacks (Kubernetes), and CI/CD pipelines.
  • Proven experience drafting technical assessments or presentation materials used to communicate technical findings to senior leadership.
  • Strong communication and interpersonal skills, with the ability to influence and drive technical alignment across stakeholder groups.
  • Prior experience working within financial services or another highly-regulated industry.

What the JD emphasized

  • second-line oversight
  • effective challenge
  • technical risk analyses
  • emerging automation capabilities (including Generative AI toolsets)
  • independent tests of our security and technology risk
  • independent, technical risk assessments
  • Evaluate first-line cloud engineering practices against enterprise risk appetites
  • Actively evaluate the integration of cutting-edge technologies
  • emerging Gen AI/ML tooling
  • Formulate Risk Recommendations
  • highly-regulated industry