High Performance Kubernetes (HPK)
By Giorgos Saloustros
Bridging Cloud-Native Workflows and HPC for DaFab
One of the main goals of the DaFab project is to enable seamless, multisite scientific workflows—without the need to physically move data between sites. This is crucial in environments where data transfer is restricted by bandwidth, administrative domains, or security policies. One of the main DaFab’s goals is to orchestrate distributed workflows across multiple HPC and cloud sites, letting data stay in place while computation moves to where it’s needed.
There are two types of sites: cloud sites (Kubernetes-based) and HPC sites (Slurm-based). Each site has its own scheduler, policies, and support for cloud-native tools. This makes it difficult to build workflows that use both cloud and HPC resources together. HPK solves this problem.
The Challenge: Cloud vs. HPC
Modern data science and machine learning workflows are increasingly complex, often requiring both the flexibility of cloud-native tools and the raw computational power of high-performance computing (HPC) clusters. Yet, these two worlds have long been separated by technical and operational barriers. Cloud environments rely on Kubernetes for container orchestration, while HPC clusters use job schedulers like Slurm and have their own strict policies and networking setups. This divide makes it hard to move workloads between environments or to build hybrid pipelines that leverage the best of both.
HPK: A Practical Bridge
High Performance Kubernetes (HPK) is designed to break down these barriers. HPK allows you to run unmodified Kubernetes workloads—think Spark jobs, Argo workflows, or distributed PyTorch training—directly on HPC clusters. No need to rewrite your code, repackage your containers, or learn a new interface. HPK brings the cloud-native experience to supercomputers, and crucially for DaFab, it enables federated workflow managers to connect to Slurm-based sites using state-of-the-art Kubernetes federation tools.
Key Innovations
HPK introduces several technical innovations to make this possible:
- Kubernetes-in-a-Box: HPK packages all Kubernetes control plane components into a single, portable container. This makes it easy to deploy a “mini cloud” inside any HPC cluster, with no need for root access or complex setup.
- hpk-kubelet: This custom agent acts as a translator, converting Kubernetes pod specs into Slurm jobs and managing their lifecycle using Apptainer containers. It ensures that all resource management aligns with HPC policies.
- hpk-pause: A lightweight binary that manages pod networking and lifecycle, ensuring robust integration with the HPC environment.
- Seamless Networking: HPK adapts Kubernetes networking to HPC realities, using Linux network namespaces and CNI plugins like Flannel to provide pod isolation and communication—even on complex HPC fabrics.
How Does It Work in Practice?
Let’s say you have a data science pipeline built with Argo Workflows, or a distributed training job using PyTorch and Kubeflow. With HPK, you can deploy these workloads on an HPC cluster just as you would in the cloud. HPK handles the translation behind the scenes, mapping Kubernetes abstractions to HPC resources and ensuring everything runs smoothly.
What’s Under the Hood?
HPK’s architecture is designed for maximum compatibility and minimal friction:
- User-Friendly Deployment: All binaries and dependencies are packaged in containers. Users don’t need root access or to modify the host system.
- Resource Management: HPK defers all resource allocation to the HPC scheduler (like Slurm), ensuring compliance with existing policies and accounting.
- Pod Lifecycle Management: The hpk-kubelet and hpk-pause binaries handle pod creation, monitoring, and cleanup, translating Kubernetes actions into HPC-native operations.
- Networking: HPK uses network namespaces and CNI plugins to provide each pod with an isolated network environment, compatible with HPC networking constraints.
Watch HPK in Action
In the following video, we run the workflow that generates metadata for the field delineation use case in the DaFab project. This demo takes place on a Slurm-based Amazon cluster and shows three main components working together:
- Kubernetes is deployed on the HPC cluster using HPK.
- Argo Workflows orchestrate the pipeline.
- The system workflow runs end-to-end, combining cloud-native and HPC resources.
Watch the demo here:
Why Does This Matter?
HPK opens up new possibilities for both cloud and HPC users. Cloud-native developers can tap into the immense power of supercomputers without leaving their familiar tools. HPC centers can offer modern, flexible interfaces to their users, attracting new communities and enabling more complex, hybrid workflows. For DaFab, HPK is a key enabler for federated, multisite workflow management—letting you orchestrate jobs across diverse environments without moving data.
Get Involved
HPK is an open, evolving project. If you’re interested in running cloud-native workloads on HPC, or if you want to help shape the future of hybrid computing, we invite you to get involved! Check out the HPK project on GitHub, or reach out to the University of Crete’s Computer Science Department.
Ready to bridge the gap between cloud and HPC? HPK is your on-ramp to the future of scientific and industrial computing.
Giorgos Saloustros
Institute of Computer Science (ICS)
Foundation for Research and Technology - Hellas (FORTH)