Senior Hpc Software Engineer

Ford Ford · Auto · United States · Enterprise Technology

Seeking a senior technical contributor to support, modernize, and scale an on-premise high-performance computing (HPC) platform. The role involves Linux systems administration, HPC operations, Kubernetes, automation, observability, software tooling, and user-facing platform delivery, with a focus on improving reliability and usability. Responsibilities include developing and maintaining core HPC services, supporting AI/ML workloads, and creating tooling and integrations using Python or Go.

What you'd actually do

  1. Administer, troubleshoot, and improve RHEL based high performance computing environments supporting CPU and GPU workloads.
  2. Create and maintain HPC services across compute, storage, networking, scheduling, Kubernetes, and observability.
  3. Develop tools, scripts, APIs, integrations, and automation using Python, Go, Bash, or similar languages.
  4. Apply software engineering best practices, including Git workflows, code reviews, testing, modular design, and CI/CD.
  5. Support and help update HPC scheduling environments, with Slurm experience preferred.

Skills

Required

  • RHEL based systems administration
  • HPC operations
  • Kubernetes
  • Automation
  • Observability
  • Software tooling
  • Python
  • Go
  • Bash
  • Git workflows
  • Code reviews
  • Testing practices
  • CI/CD pipelines
  • Maintainable code design
  • Troubleshooting skills
  • Documentation
  • Communication skills

Nice to have

  • Slurm experience
  • Grafana
  • Prometheus
  • Dynatrace

What the JD emphasized

  • 10+ years of experience
  • Strong Linux systems administration experience
  • Experience with Slurm, PBS, or another HPC workload manager
  • Hands-on experience with scripting and software development using Python, Go, Bash, or similar languages
  • Strong troubleshooting skills