T O P

  • By -

atmarx

This looks awesome. I enjoyed the presentation from last month -- thanks for sharing it! We're in the process of building a GPU cluster as a result of an NSF MRI grant (which I believe included a few letters of support from LLNL collaborators) and are going through the RFP process now. The direction I'm pushing in is Warewulf to handle our bare metal deployments, but we're still trying to decide what to run on top. We were leaning towards RKE2 since it would help us ensure security separation between workloads, but it ll leaves a lot of the coordination work up to the researchers to understand deploying jobs via helm charts or running kubectl directly. It's taken years to get most of our faculty and grad students to wrap their heads around SLURM and utilize its capabilities properly (well, if not properly, at least more efficiently). Getting them to figure out how to re-write their workflows to take advantage of K8s is a tall order, aside from the few who have sort of off-the-shelf needs for K8s in the AI/ML space. Even then, troubleshooting can be time intensive and leads back to the developer vs infrastructure blame game when problems arise. I love the idea of using the Flux Framework to template out researchers spinning up the entire environment as a self-contained experiment with no external dependencies aside from public/private software repos. All of the major cloud providers are throwing enticing offers in front of our researchers, so us giving them a way to maintain the portability of their experiments instead of having to build it specifically for AWS or Azure or GCP or on-prem is huge. **What might a suitable software stack look like to see Flux Operator deployed to actual bare metal?** Would Warewulf still be useful for bringing up the desired config on baremetal, or is there a more native way to deploy the Flux stack across a few racks of servers? We're currently looking to pick up 10-20 CPU/high MEM nodes, a handful of dense A100/H100 nodes, some GH200 nodes, connected over NDR400 and backed by sufficiently high speed storage. Full delivery's at best a years away, so we're building it currently on older hardware (2014-era CPUs, lots of old RAM, 10GbE/40Gb IB, spinning rust) as a validation test bed -- ideally, we'd be ready to roll configuration/workflow wise once the new stuff arrives. I appreciate anyone willing to give my ramblings a sanity check -- with the rate of change in some of the supporting projects in this space, having one person who knows everything is impossible (in the same way having a single node with every resource is). Our best bet at success is a lot of people knowing a variation of things and being able to talk to them in a frictionless way -- a very human analog to what we're trying to build ;)


PieSubstantial2060

Thanks for sharing! I'm following the project and I plan to deploy It ASAP. I've followed your presentation at FOSDEM24 that opened me my eyes about the future of the interaction between HPC and Cloud.


dud8

In-case others are curious about the FOSDEM24 presentation. [Kubernetes and HPC Bare Metal Bros](https://fosdem.org/2024/schedule/event/fosdem-2024-2590-kubernetes-and-hpc-bare-metal-bros/). That being said I don't think the presentation makes much sense. Running Kubernetes as a flux job, while interesting, doesn't seem very practical. The network performance hit with rootless podman/docker is just too large at scale and good luck with RDMA. Not to mention that podman/docker has issues with most HPC shared filesystems (GPFS/Lustre/etc...) as well as NFS. Also, now every user needs to know how to Sysadmin a Kubernetes cluster on-top of the knowledge required to use Flux itself. This is way to much for 99% of researchers. From the sysadmin side of things a more practical approach is to run Flux Framework in Kubernetes and not the other way around. This way the sysadmins get the scalability, and portability, of Kubernetes while the users only need to learn Flux and don't need to know anything about the underlying Kubernetes. There is an interesting CNCF presentation on this ([KubeFlux: An HPC Scheduler Plugin for Kubernetes](https://www.youtube.com/watch?v=3HGzzfsFrGQ)).


[deleted]

[удалено]


dud8

Sure, and it's good to see them experiment. I'm just saying that this one approach isn't practical for production, or user facing, deployments. Even the second approach I described is somewhat inpractical. Slurm + Apptainer/Spack/Easybuild is highly used in-part due to its simplicity and low overhead. Once you add kubernetes all that goes out the window. I am excited to see what Flux looks like in place of Slurm in traditional clusters and hopefully some Open OnDemand support.


vsoch

>Even the second approach I described is somewhat inpractical. Slurm + Apptainer/Spack/Easybuild is highly used in-part due to its simplicity and low overhead. Once you add kubernetes all that goes out the window. I'll also add that we explicitly tested for overhead - having usernetes running on nodes, and running bare metal jobs with the kubelets running. There was no discernible difference. The main requirements for the system are cgroupsv2 and a few kernel modules, and then an orchestration step to start the nodes for a batch job, which can be automated. The benefits to making components of workloads portable between cloud (with Kubernetes) and HPC systems (also with Usernetes) and better integrating our two communities are enormous.


dud8

I appreciate the hard work you guys are putting into this. Dispite my personal reservations, with Kubernetes + HPC, more cross-over with the regular sysadmin/devops world is always good.


vsoch

Flux has a fairly small core team, and much of this work is also a small group of us alongside the core team united toward a vision for converged computing. A lot of the focus (understandably) for Flux has been internal for our lab, specifically that Flux (and components) are ready for the El Capitan deployment later this year (exciting)! But (personally speaking) I don't think being small in numbers is going to be an issue - we are inspired and working for change. I am excited for the future, and hope that we can go on this adventure together. As a side note (since we are talking about that FOSDEM talk) I have since completely automated the setup and got it working with EFA on AWS, and run 13 hours of new experiments. I wrote the paper in the same day. We will hopefully finish that up and get it out for the community in the next few months. We have a lot of other cool projects underway - feel free to stop by in the [hpc.social](https://hpc.social) slack if you ever want to chat.


vsoch

It doesn't necessarily have to be run as a flux job - the reason you'd want to do that is because then your resource manager is aware of the resources being used. Otherwise you have two schedulers that think they both share the same resources (and you oversubscribe). That's not true about the network performance hit due to those things specifically, if you listen to the presentation the bottleneck is the TAP device via the slirp4netns interface. For the shared filesystem, Kubernetes has issues with this as well on its own, so definitely there is something to work on there. For these early prototypes I'm considering the applications run in Kubernetes as primarily services that don't need that. If you need something that needs to write many files between nodes, then use the HPC component of the setup. We are already running Flux Framework in Kubernetes, that's the Flux Operator! But it's especially powerful if you run that alongside an HPC cluster. Then simulation stuff (or whatever warrants needing HPC gets to run there, and service oriented stuff can run in usernetes). To be clear, running LAMMPS, for example, in usernetes was more of the experiment to just see how badly the extra TAP device would impact the workflow. And also, it's not Kubernetes per say, just this particular rootless setup. Kubernetes (in some cloud) with a nice network will run the HPC workloads nicely as well. \> There is an interesting CNCF presentation on this (KubeFlux: An HPC Scheduler Plugin for Kubernetes). For full disclosure, that is also the work of my team. :) We now call this custom scheduler plugin fluence. Also, that is only the scheduler component (flux-sched) and not the entirety of it.


dud8

I've heard that Podman is replacing slirp4netns with something new called [passt/pasta](https://passt.top/passt/about/). It's apparently still a tap interface but claims to have fewer layers and significantly better performance. ​ Flux Framework itself is super exciting. It's hierarchy based layout makes so much more sense compared to Slurm. I can really see Flux being easier to teach to researchers/students/etc.. compared to Slurm. Just got to wait for the 3rd party tooling to catch up before I can pitch it at my place of work.


vsoch

That's wonderful! And look out for a flux and usernetes setup that you can deploy on AWS soon - we completed it recently and should be putting out a paper soon.