Principal AI Ops Architect, Gps

Scale AI Scale AI · Data AI · London, United Kingdom · GPS Engineering

This role focuses on designing and developing the production lifecycle of full-stack AI applications for international government partners, ensuring end-to-end system reliability, inference observability, data orchestration, security, and resilient cloud infrastructure. It involves owning production outcomes, scaling feedback loops, navigating global compliance, leading incident response, and driving product evolution based on field learnings.

What you'd actually do

  1. Own the production outcome: Take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies.
  2. Ensure Full-Stack integrity: Oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment.
  3. Scale the feedback loop: Build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability.
  4. Navigate global compliance: Manage the technical lifecycle within diverse regulatory frameworks.
  5. Incident command: Lead the response for production issues in mission-critical environments, ensuring rapid resolution and building the guardrails to prevent them from happening again.

Skills

Required

  • 6+ years in a high-impact technical role (SRE, FDE or MLOps)
  • experience in the public sector
  • familiarity with international government security standards and the complexities of deploying sovereign AI
  • proven experience maintaining production-grade applications with a deep understanding of the full request lifecycle-connecting frontend/API layers to the backend and AI core
  • proficiency in coding and the modern AI infrastructure
  • ownership mentality
  • understanding of public sector reliability needs
  • ability to explain technical performance to high-ranking officials

Nice to have

  • Kubernetes
  • vector databases
  • agentic development
  • LLM observability tools

What the JD emphasized

  • public sector
  • international government agencies
  • global compliance
  • sovereign AI
  • mission-critical environments
  • reliability

Other signals

  • production lifecycle of full-stack AI applications
  • end-to-end system reliability
  • real-time inference observability
  • sovereign data orchestration
  • high-security software integration
  • resilient cloud infrastructure
  • international government partners
  • AI applications that deliver real impact