Principal Engineer - AI Platform (ai Factory /advance Ai)

Target Target · Retail · NCD-0375 Brooklyn Park, MN

Principal Engineer role focused on building and scaling an enterprise ML Operations Platform at Target. The role involves defining architecture, setting standards for ML lifecycle management, deployment, governance, and observability, and enabling hundreds of engineers and data scientists to deliver AI solutions efficiently and safely. Key responsibilities include designing cloud-native platforms, leading architecture for inference, driving adoption of Kubernetes and modern platform engineering, and mentoring senior engineers.

What you'd actually do

  1. Define the long-term technical strategy and architecture for the enterprise ML Operations Platform.
  2. Design scalable, secure, and resilient cloud-native platforms supporting machine learning workloads.
  3. Establish best practices for model development, deployment, monitoring, and lifecycle management.
  4. Lead architecture for enterprise machine learning infrastructure supporting batch, streaming, and real-time inference.
  5. Drive adoption of cloud-native technologies, Kubernetes, and modern platform engineering practices.

Skills

Required

  • MS in Computer Science, Engineering, Mathematics, or related technical field with relevant software engineering experience
  • Extensive experience designing and delivering large-scale cloud-native platforms or distributed systems
  • Deep experience building and operating enterprise machine learning platforms and MLOps capabilities
  • Strong understanding of machine learning lifecycle management, deployment strategies, observability and production operations
  • Demonstrated experience with machine learning platforms and tooling such as Vertex AI, Kubeflow, MLflow, and/or equivalent technologies
  • Experience building developer platforms or internal platform products
  • Experience with distributed training, GPU infrastructure, and large-scale inference platforms
  • Experience with feature management, model governance, and responsible AI practices.
  • Familiarity with Generative AI platforms and infrastructure supporting foundation model workloads
  • Experience with Terraform, GitOps, service mesh technologies, and platform automation
  • Experience mentoring senior engineers and leading enterprise-scale modernization initiatives
  • Expertise designing Kubernetes-based platforms supporting AI and machine learning workloads
  • Strong understanding of software engineering best practices including CI/CD, infrastructure as code, observability, testing, and automation
  • Experience defining technical strategy, architectural standards and engineering best practices across multiple teams
  • Excellent communication and influencing skills with the ability to communicate complex technical concepts to engineering and business leaders

What the JD emphasized

  • define the architecture and evolution of our enterprise machine learning platform
  • establish scalable patterns for developing, deploying, monitoring, and governing machine learning systems
  • defining the architecture and evolution of our enterprise machine learning platform
  • establish enterprise-wide standards for machine learning lifecycle management, deployment, governance, observability, and operational excellence
  • partnering with cross-functional engineering teams to modernize AI platform capabilities
  • influence architecture across multiple organizations
  • guide strategic platform investments
  • define the long-term technical strategy and architecture for the enterprise ML Operations Platform
  • Establish best practices for model development, deployment, monitoring, and lifecycle management
  • Lead architecture for enterprise machine learning infrastructure supporting batch, streaming, and real-time inference
  • Define standards for model governance, observability, reliability, explainability, and responsible AI
  • Experience building developer platforms or internal platform products
  • Experience with feature management, model governance, and responsible AI practices.

Other signals

  • enterprise ML platform
  • MLOps
  • deploy, govern, and operate ML solutions at scale
  • accelerate AI adoption