Agentic AI / AI Ops Engineer – Platform Engineering

Caterpillar Caterpillar · Industrial · Brisbane, Queensland

This role focuses on building and deploying production-grade Agentic AI and AI Ops solutions to enable intelligent, automated, and reliable platform operations. It involves designing and implementing agentic AI systems for multi-step workflows, building AI-driven automation for operations, and integrating AI solutions with platform infrastructure.

What you'd actually do

  1. Design and implement agentic AI systems that plan, reason, and execute multi-step workflows across platform and portal ecosystems
  2. Build and deploy AI-driven automation for operations (incident detection, triage, remediation, and monitoring)
  3. Develop and productionize LLM-based applications and intelligent workflows integrated with APIs, tools, and enterprise systems
  4. Integrate AI solutions with platform infrastructure (Kubernetes, CI/CD, observability, telemetry pipelines)
  5. Establish scalable patterns for AI lifecycle (design → deployment → monitoring → optimization)

Skills

Required

  • Experience building and deploying AI/ML or Generative AI solutions in production
  • Strong software engineering fundamentals (system design, CI/CD, testing, monitoring)
  • Experience with cloud-native and distributed systems

Nice to have

  • Experience building agentic or LLM-based systems (e.g., multi-step workflows, tool integration, memory/context handling)
  • Strong programming skills (Python or similar)
  • Kubernetes and modern platform infrastructure
  • Observability (logs, metrics, traces) and telemetry systems
  • Workflow/orchestration frameworks (e.g., LangGraph, AutoGen, similar)
  • Understanding of AI Ops, SRE practices, and reliability engineering principles
  • Experience productionizing AI systems with focus on scalability, performance, and reliability

What the JD emphasized

  • production-grade AI systems
  • autonomous workflows
  • agentic AI systems
  • AI-driven automation
  • LLM-based applications
  • platform infrastructure
  • AI lifecycle
  • AI/ML or Generative AI solutions in production
  • agentic or LLM-based systems
  • AI Ops, SRE practices, and reliability engineering principles
  • scalability, performance, and reliability

Other signals

  • production-grade AI systems
  • autonomous workflows
  • AI lifecycle management
  • agentic AI systems