Sr. Platform Engineer - AI Agentic

Comcast Comcast · Media · West Chester, PA +1

This role focuses on designing, building, and operating cloud infrastructure for AI-enabled and Agentic platforms. The engineer will apply IaC practices, monitor platform health, lead troubleshooting, develop automation, implement observability solutions, and collaborate with AI teams to ensure platforms are operable, observable, and scalable. Foundational AI literacy is required to understand agent lifecycles and non-deterministic execution behaviors. The role also involves using AI-assisted tools for investigation and documentation, and participating in on-call rotations.

What you'd actually do

  1. Design, implement, and operate cloud infrastructure supporting scalable, highly available AI‑enabled and Agentic platforms.
  2. Apply Infrastructure as Code (IaC) practices (e.g., Terraform, Packer, Ansible) to provision and manage cloud resources consistently and securely.
  3. Monitor platform health using metrics, logs, dashboards, and alerts, applying critical thinking to distinguish infrastructure, application, and AI‑driven failures.
  4. Lead troubleshooting and resolution of complex cloud and platform issues, including distributed system and integration failures.
  5. Develop and maintain automation and tooling (primarily Python and shell scripting) to improve reliability, diagnostics, and operational efficiency.

Skills

Required

  • Cloud Infrastructure Engineering (AWS, Azure, or GCP)
  • Infrastructure as Code (Terraform, Packer, Ansible, or equivalent)
  • Python Programming for automation and operational tooling
  • Monitoring, Logging, and Alerting Systems
  • CI/CD Tools and Release Automation
  • Security Fundamentals (IAM, secrets management, encryption, network security)
  • Foundational AI literacy

Nice to have

  • Experience supporting AI, ML, or Agentic platforms in production
  • Familiarity with data and streaming platforms (e.g., S3, SQS, Kafka‑like systems)
  • Understanding of networking and protocol standards
  • Exposure to hardware security modules (HSMs) or advanced key management solutions
  • Experience operating large‑scale distributed systems
  • Cloud cost optimization and performance tuning experience

What the JD emphasized

  • intelligent agents end to end
  • agent logic, workflows, integrations, and decision frameworks
  • deploy those agents into production and own their ongoing behavior, reliability, and impact
  • monitor agent performance validates AI driven actions with human judgment
  • owning both construction and operations
  • AI‑enabled and Agentic platforms
  • AI‑driven failures
  • agent lifecycles, orchestration patterns, and non‑deterministic execution behaviors
  • AI‑assisted tools

Other signals

  • Designing, building, and operating intelligent agents end to end
  • Develop agent logic, workflows, integrations, and decision frameworks
  • Deploy those agents into production and own their ongoing behavior, reliability, and impact
  • Monitor agent performance validates AI driven actions with human judgment
  • Own both construction and operations