Senior Icloud Efficiency Engineer (genai & Agentic Systems)

Apple Apple · Big Tech · Culver City +2 · Software and Services

Senior Engineer to lead GenAI and agentic systems for improving efficiency in large-scale cloud infrastructure operations. Focuses on building practical AI systems for engineering workflows, capacity planning, anomaly detection, and operational safety, using LLMs, RAG, and automation frameworks.

What you'd actually do

  1. Design and implement GenAI-powered solutions to improve infrastructure efficiency, operational workflows, and engineering productivity across iCloud services
  2. Build and deploy agentic systems using technologies such as Claude, LLM orchestration frameworks, skills-based execution models, and intelligent automation pipelines
  3. Develop AI-assisted workflows for capacity planning, anomaly detection, forecasting, deployment validation, and operational safety
  4. Partner with SRE, infrastructure engineering, platform teams, and finance to identify high-value efficiency opportunities and convert them into scalable AI solutions
  5. Create reliable and safe agent workflows with strong observability, guardrails, human-in-the-loop validation, and operational controls

Skills

Required

  • Python, Java, or similar languages
  • GenAI technologies
  • LLM application architecture
  • retrieval
  • context engineering
  • tool use
  • workflow orchestration
  • agentic workflows
  • evaluation
  • observability
  • failure handling
  • cloud infrastructure operations
  • observability
  • deployment systems
  • operational safety principles
  • analytical skills
  • systems thinking
  • cross-functional communication skills

Nice to have

  • Experience applying GenAI to infrastructure operations, SRE workflows, capacity planning, or engineering productivity systems
  • Experience building AI systems with operational guardrails, governance models, and safe deployment patterns for enterprise environments
  • capacity forecasting
  • cost optimization
  • infrastructure efficiency modeling at hyperscale
  • private cloud environments
  • large-scale storage systems
  • global distributed infrastructure

What the JD emphasized

  • production GenAI, ML platform, developer productivity, infrastructure automation, or tooling systems
  • GenAI technologies, LLM application architecture, including retrieval, context engineering, tool use, workflow orchestration, agentic workflows, evaluation, observability, and failure handling
  • operational guardrails, governance models, and safe deployment patterns for enterprise environments

Other signals

  • Apply production state-of-the-art LLM systems
  • Build and deploy agentic systems
  • Develop AI-assisted workflows
  • Drive experimentation and adoption of AI-first engineering practices