Lead Site Reliability Engineer - Dataops

Capital One Capital One · Banking · Nottingham, United Kingdom

Lead Site Reliability Engineer (DataOps) responsible for ensuring the health, reliability, and security of critical data pipelines in a hybrid cloud environment. Focuses on production support, data security, automation, performance optimization, and collaboration/leadership within a large-scale data environment. Requires experience with data distribution platforms, scheduling platforms, AWS, scripting, and DevOps/DataOps principles.

What you'd actually do

  1. Act as the subject matter expert and technical lead for resolving the most complex, high-impact incidents affecting data pipelines. Manage multiple stakeholders for critical events. Perform in-depth root cause analysis to prevent recurrence, focusing on data pipelines, scheduling platforms such as Control-M and AWS-related services.
  2. Ensure the integrity and security of highly sensitive and critical data throughout the entire pipeline. Implement and enforce security best practices, including managing encryption at rest and in transit, access controls, and compliance.
  3. Develop and implement automation for common operational tasks to reduce manual toil. Focus on building tools and monitoring solutions that provide visibility into the end-to-end health of pipelines.
  4. Proactively analyse and tune the performance of batch schedules and AWS resource utilization. Identify and implement optimizations to improve efficiency and reduce operational costs.
  5. Act as a technical leader and mentor for both onsite and offshore team members. Ensure seamless collaboration, clear communication, and consistent operational standards across a distributed team. Contribute to the long-term technical strategy for data operations including modernization efforts.

Skills

Required

  • production support, site reliability, or data operations role within a large-scale data environment
  • data distribution platforms (e.g. Ab Initio & Spark centric solutions like AWS Glue & EMR)
  • ETL/ELT workflows & integration into data platforms like Snowflake
  • scheduling platforms such as Control-M
  • AWS and its data-related services
  • open-source, cloud-first data-pipeline orchestration capabilities like Apache Airflow
  • Shell scripting & Python for automation and system administration
  • manage highly sensitive and critical data pipelines
  • security and compliance requirements
  • working effectively with both onsite and offshore teams
  • DevOps or DataOps principles and practices

Nice to have

  • IBM Sterling FileGateway or similar file transfer (MFT) solutions would be beneficial (e.g. AWS Transfer Family)

What the JD emphasized

  • mission-critical batch data pipelines
  • highly sensitive and critical data streams
  • DevOps or DataOps principles and practices is essential