AI Applications Ops Lead, Gps

Scale AI Scale AI · Data AI · London, United Kingdom · GPS Engineering

This role focuses on the production lifecycle and operational reliability of full-stack AI applications for international government partners, emphasizing system integrity, inference observability, security, and compliance within regulated environments. It involves scaling feedback loops, incident command, and translating technical metrics for stakeholders.

What you'd actually do

  1. Own the production outcome: Take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies.
  2. Ensure Full-Stack integrity: Oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment.
  3. Scale the feedback loop: Build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability.
  4. Navigate global compliance: Manage the technical lifecycle within diverse regulatory frameworks.
  5. Incident command: Lead the response for production issues in mission-critical environments, ensuring rapid resolution and building the guardrails to prevent them from happening again.

Skills

Required

  • 6+ years in a high-impact technical role (SRE, FDE or MLOps)
  • experience in the public sector
  • familiarity with international government security standards
  • proficiency in coding and the modern AI infrastructure
  • Kubernetes
  • vector databases
  • agentic development
  • LLM observability tools
  • ability to explain to a high-ranking official why the performance of the system has degraded and how we are fixing it

Nice to have

  • deploying sovereign AI

What the JD emphasized

  • public sector
  • sovereign AI
  • regulatory frameworks
  • mission-critical environments
  • public safety or privacy

Other signals

  • production lifecycle of full-stack AI applications
  • end-to-end system reliability
  • real-time inference observability
  • sovereign data orchestration
  • high-security software integration
  • resilient cloud infrastructure