On-premise Kubernetes for Machine Learning — A couple Kubelet tips
When creating an on-premise Kubernetes cluster, one trades the ease of handling a mostly configured cloud-hosted solution, with the power (and responsibility) of a more complex solution, which can be adapted to the specific on-premise requirements. This also requires a relatively deep know-how about networking, storage and compute considerations, but also gives the freedom to tweak and optimize the cluster for specific use-cases — like Machine Learning.
There are a number of options to choose from for bootstrapping an on-premise Kubernetes cluster for Machine Learning, 2 of which are very commonly used:
Kubespray — a collection of Ansible playbooks, which enable provisioning and configuration management for a production ready cluster
Kubeadm — a utility to create a minimum viable Kubernetes cluster
Both of these approaches install and configure the basic Kubernetes control plane components:
- Controller Manager
- API Server
As well as the component running on every node, controlling the container runtime:
This final component (Kubelet) has a few configurations that can be extra beneficial for a Machine Learning setup. Here are a few reasons why and steps to configure them:
- ML docker images can unexpectedly fill up disk space
Consider ML docker images running a framework of your choice (like Tensorflow or Pytorch), the CUDA toolkit, additional python libraries, etc.. Such images can easily grow in size of more than 10 GB, at times even to more than 15 GB. This might not be a problem when the storage pool at hand is large enough, however, given a smaller disk which shares docker images with other processes (like an OS disk), there is a chance the disk fills up via multiple sources before garbage collection (GC) kicks in properly and unexpected events occur. A simple countermeasure would be to separate docker images from the OS space onto a separate disk, where the Kubelet is fully in control and can engage GC without other processes also writing to that disk.
In case a separate disk is not available, the Kubelet can be also configured to perform GC at a custom-defined threshold. On a Kubernetes cluster you can configure the Kubelet via the
KUBELET_EXTRA_ARGS environment variable.
This variable is sourced from:
/etc/default/kubelet(on DEB-based distros)
/etc/sysconfig/kubelet(on RPM-based distros)
The options to configure GC are
--image-gc-high-threshold which define the % of disk usage to which image GC attempts to free (default 80%) and the % of disk usage which triggers image GC (default 85%) respectively. The garbage collection will delete least recently used images until the low threshold has been met.
To configure the disk usage threshold to 70%, with GC kicking-in at 80% , you add the following content to the Kubelet configuration file on the nodes:
KUBELET_EXTRA_ARGS="--image-gc-low-threshold 70 --image-gc-high-threshold 80"
After that, the configuration needs to be reloaded and the Kubelet process restarted for the changes to take effect:
systemctl daemon-reload && systemctl restart kubelet
2. ML docker image pulls can be prematurely interrupted
Following the Dockerfile best practices, you are most likely using multi-stage builds and minimizing the number of layers. This last approach might mean when you pull the image from your repository, it has very few, or even just a single very large layer. This can be problematic for the Kubelet, as it is configured by default for “small” microservices, where docker images of “ML size” are outliers and it is generally expected an image, even on a relatively slow internet connection, will be able to get at least 1 layer pulled within a timeframe of a few minutes. A single “ML-sized” layer can take much longer than that to download and during this time the Kubelet receives no updates, so it eventually marks the pull as failed.
The Kubelet warning message when a pull of a big image fails is also not very informative:
Failed to pull image "private-registry/big-ml-image:latest": rpc error: code = Canceled desc = context canceled
The Kubelet can avoid this happenstance with some additional configuration. By default, it cancels a docker image pull, if no progress is made within 1 minute (% of layer downloaded is not seen as progress, only fully downloaded layers are counted as progress). This can be configured with the option
The option is set in the kubelet configuration file with the time in minutes (the time should of course be adjusted to the used internet bandwith):
Combining both GC and pull progress deadline configurations would look like this:
KUBELET_EXTRA_ARGS="--image-gc-low-threshold 70 --image-gc-high-threshold 80 --image-pull-progress-deadline 30m"
With such additional configuration, the 2 very common kubernetes problems regarding ML workloads can be avoided.
These configuration steps are not the only way to configure the Kubelet. It is also possible to use tools like Kubeadm & Kubespray to pre-configure the Kubelet already during the bootstrap process. Documentation of both tools explains how to accomplish this and also offers other options for further Kubelet adjustments. You can also throw a glance at the Kubelet API reference for detailed information on its other various configuration options