My name is Mikhail Mokrushin, I’m the Managed Schedulers Team Leader at Nebius. In this article, I’ll share the details of this open project. I wo

Explaining Soperator, Nebius’ open-source Kubernetes operator for Slurm

submited by
Style Pass
2024-09-27 16:30:01

My name is Mikhail Mokrushin, I’m the Managed Schedulers Team Leader at Nebius. In this article, I’ll share the details of this open project. I won’t rehash what Slurm and Kubernetes are, their differences, or how to use them for computational jobs — all of that was covered in the previous post. Instead, I’ll focus on methods for combining Slurm and K8s and on the architectural approach we’ve chosen to bring these two systems together. Soperator is available on GitHub under the Apache 2.0 license.

Our solution Features and reasoning - Shared root filesystem and How we implemented this - GPU health checks and How we implemented this - Easy scaling - High availability - Isolation of user actions - Easy bootstrapping - Observability

Current limitations Future plans Ways to try it - GitHub repositories - Deploying to any Kubernetes cluster - Deploying to Nebius

Leave a Comment