Senior Software Engineer - AI Observability - Ai, Search & Knowledge Platform

Apple Apple · Big Tech · Cupertino, CA · Software and Services

Senior Software Engineer focused on building AI-enabled observability solutions for Apple's AI, Search, and Knowledge Platform. The role involves designing and developing user-facing observability features for AIML products and infrastructure, leveraging AI/ML for AIOps, and working with LLMs, ML frameworks, and agentic AI concepts. The position emphasizes building scalable, cloud-native distributed systems and microservices, with a focus on anomaly detection, incident detection, and root-cause analysis within AI observability.

What you'd actually do

  1. design and build AI observability solutions that power Apple Intelligence, Search, and AI infrastructure
  2. lead the design and development of user-facing observability features for AIML products and infrastructure
  3. providing technical guidance, sharing observability best practices and know-how, leveraging AI pipelines and mentoring the team
  4. building and operating large-scale, cloud-native, distributed systems and microservices
  5. using LLM and ML models for AIOps and model observability

Skills

Required

  • 7+ years of software engineering experience building and operating large-scale, cloud-native, distributed systems and microservices
  • 7+ years of software engineering experience and strong background in computer science: distributed systems, algorithms and data structures, APIs and highly-scalable, reliable systems and micro-services
  • Demonstrated experience using LLM and ML models for AIOps and model observability
  • Hands on experience building ML pipelines, portable workflows and in model tuning to deploy ML and LLM models in production for customer-facing features
  • Hands on experience using LLMs, ML frameworks, i.e. TensorFlow, PyTorch and libraries like Scikit-learn, NumPy, LangChain, MLFlow, KubeFlow
  • Experience building services for Observability Analysis, including anomaly detection, incident detection, automated remediation, and root-cause analysis
  • Excellent verbal and written communication, problem solving, and cross-team collaboration skills

Nice to have

  • Knowledge of current Gen AI research and techniques: MCPs, RAG systems, Agentic AI (multi-agent orchestration, tool calling)
  • Hands-on experience with agentic AI frameworks (e.g. LangGraph, AutoGen, CrewAI) for building multi-step reasoning and tool-using agents
  • Experience designing multi-agent orchestration, tool-calling, or RAG systems for operational/diagnostic workflows
  • Demonstrated proficiency operating workloads on public and/or private cloud platforms, Kubernetes, object storage, networking, databases, and observability services
  • Demonstrated experience in building observability systems for metrics, distributed tracing, logs, profiling
  • Experience with large scale observability visualization tools like Grafana, DataDog, and ELK
  • Building large-scale incident management, alert management and notification systems
  • Active contributions to CNCF or open source projects (e.g., k8sGPT, HolmesGPT, kagent, OpenTelemetry, Prometheus)

What the JD emphasized

  • AI observability engineer
  • AI enabled observability solutions
  • AI observability solutions
  • AI observability
  • AIML products and infrastructure
  • AI pipelines
  • LLM and ML models for AIOps
  • model observability
  • ML pipelines
  • model tuning
  • deploy ML and LLM models in production
  • Gen AI research
  • Agentic AI
  • multi-agent orchestration
  • tool calling
  • RAG systems
  • operational/diagnostic workflows

Other signals

  • AI-enabled observability
  • AI infrastructure
  • LLM and ML models for AIOps
  • ML pipelines
  • model tuning
  • Gen AI research
  • Agentic AI
  • multi-agent orchestration
  • tool calling
  • RAG systems