Staff Engineer High Performance Computing

Pfizer Pfizer · Pharma · New York, NY

Pfizer is seeking an experienced Staff Engineer to lead the technical architecture of their cloud High-Performance Computing (HPC) platform, supporting computational workloads in drug discovery and development. The role involves establishing go-forward cloud HPC platform computing technologies, implementing robust engineering practices, championing infrastructure as code (IaC), and configuring core services for HPC at scale in cloud environments (AWS/GCP). Responsibilities include designing and owning high-throughput, parallel, low-latency infrastructure for HPC and ML/AI workloads, recommending cutting-edge HPC technologies, and ensuring performance, reliability, scalability, cost efficiency, and security. The role also focuses on automation, DevOps, and monitoring/reliability strategies for the infrastructure.

What you'd actually do

  1. Serve as a primary technical expert; evaluate, advocate for, and drive consensus among senior managers and engineers for the go-forward technology platforms and toolkits used for HPC service delivery.
  2. Design and own robust and dependable high-throughput, parallel, low-latency infrastructure for HPC and ML/AI workloads in multiple cloud environments (AWS/GCP).
  3. Drive adoption of infrastructure automation using IaC tools like Terraform and CloudFormation.
  4. Determine KPIs to guide monitoring, logging, and alerting strategies for the infrastructure.
  5. Collaborate with stakeholders, users, and leaders to develop a long-term technical roadmap for cloud-based HPC services.

Skills

Required

  • B.S. in computer science, life science, data science or similar fields with 6+ years of experience in cloud infrastructure engineering.
  • A proven track record of developing and supporting robust HPC frameworks in a cloud environment.
  • Expert level experience with at least one of AWS and GCP, including knowledge of core compute and storage services relevant to HPC.
  • Deep understanding of modern CI/CD practices, observability and monitoring of cloud-based HPC infrastructure.
  • Strong knowledge of distributed systems and production system reliability.
  • Familiarity with monitoring and observability frameworks (CloudWatch, Prometheus, Grafana, etc.)
  • Solid understanding of cloud networking, identity, security controls, and core services.

Nice to have

  • M.S. in computer science, life science, data science or similar fields.
  • 10-15 years experience in HPC/Cloud engineering
  • Expertise with distributed computing environments, especially EKS/GKE/Kubernetes
  • Deep experience with HPC environments, job schedulers, and NVIDIA GPU compute.
  • Prior experience with HPC deployment utilities including AWS ParallelCluster and Parallel Computing Services, and Google Cloud Cluster Toolkit
  • Familiarity with other aspects of managing HPC services in a cloud environment: cloud financial models, cost optimization, user support services, application delivery, Linux administration, job scheduling, resource optimization.

What the JD emphasized

  • cloud HPC platform
  • HPC workloads
  • ML/AI workloads

Other signals

  • HPC platform computing technologies
  • cloud HPC platform
  • HPC workloads
  • ML/AI workloads