Sr Software Engineer - Ai, Search & Knowledge Platform – Cloud Infrastructure

Apple Apple · Big Tech · Cupertino, CA · Machine Learning and AI

This role focuses on building and scaling cloud-native ML infrastructure, specifically platforms for ML training and inference. It involves designing and implementing agentic workflows, Kubernetes-based control planes, and infrastructure servers to manage these systems at a large scale. The role emphasizes open-source contributions and expertise in Kubernetes and Crossplane.

What you'd actually do

  1. Architect and develop cloud-native, agentic infrastructure platforms supporting ML training, inference, and large-scale distributed systems.
  2. Lead and mentor engineers building Crossplane-based control planes, Kubernetes operators, and ArgoCD-driven GitOps automation.
  3. Design, implement, and optimize MCP-based infrastructure servers that contextualize and manage infrastructure and application state across environments.
  4. Contribute to CNCF open-source projects and represent Apple in the cloud-native community.
  5. Implement observability, governance, and automation frameworks to ensure performance, reliability, security, and compliance.

Skills

Required

  • Kubernetes
  • Crossplane
  • Golang
  • Python
  • agentic workflows
  • cloud-native ML infrastructure
  • ML training
  • ML inference
  • distributed systems
  • Kubernetes internals
  • controller-runtime
  • Crossplane composition frameworks
  • ArgoCD
  • Helm
  • Infrastructure-as-Code
  • GitOps
  • performance tuning
  • GPU optimization
  • cost efficiency
  • performance profiling
  • system-level debugging

Nice to have

  • open-source contributor
  • CNCF projects
  • Model Context Protocol (MCP)
  • AIOps
  • LLM-driven automation
  • OpenTelemetry
  • Prometheus
  • Grafana
  • model registries

What the JD emphasized

  • deep expertise in Kubernetes, Crossplane, Golang/Python, and agentic workflows
  • ML training and inference at massive scale
  • architect systems that are declarative, self-managing, and highly performant
  • agentic, event-driven workflows, Crossplane compositions, and self-healing control planes
  • highly automated and observable infrastructure
  • ML engineering, SRE, and platform teams
  • intelligent infrastructure
  • ML training and inference pipelines
  • Crossplane-based control planes, Kubernetes operators, and ArgoCD-driven GitOps automation
  • MCP-based infrastructure servers
  • observability, governance, and automation frameworks
  • agentic orchestration workflows
  • GitOps, Infrastructure-as-Code, and Kubernetes cluster lifecycle automation
  • resilient, cost-efficient, and optimized for performance
  • Kubernetes internals, controller-runtime, and Crossplane composition frameworks
  • ArgoCD, Helm, and IaC (Terraform or Crossplane)
  • GitOps and reconciliation-driven workflows
  • design and operate infrastructure for ML training and inference
  • performance tuning and GPU optimization
  • leading technical teams and driving architectural decisions
  • cost efficiency, performance profiling, and system-level debugging
  • Contributions to CNCF open-source projects (Kubernetes, Crossplane, ArgoCD, Envoy, Prometheus, etc.)
  • Kubernetes API machinery, CRDs, and control plane development
  • Model Context Protocol (MCP) or contextual infrastructure servers
  • AIOps or agentic/LLM-driven automation in production environments
  • observability and distributed tracing (OpenTelemetry, Prometheus, Grafana)
  • building ML infrastructure platforms (training clusters, inference systems, model registries)

Other signals

  • ML training and inference at massive scale
  • cloud-native ML infrastructure
  • agentic workflows