Quick hack: Adding GPU support to kind

This post has been superseded with this tutorial that no longer requires any code changes. Please read that instead.

Leaving this here for historic reasons only.

I needed GPU support in kind, so I added it. I’m also prone to yak shaving so it’s quick, dirty and not going upstream.

When developing tools for Kubernetes I like to use kind which runs a whole cluster inside a single Docker container. I especially like using it via pytest-kind which makes running Python unit tests against a Kubernetes cluster a breeze.

Update: Feb 8th 2023 - Bumped my fork up to kind 0.17.0 and updated post.

Today as of kind 0.17.0 there is no support for passing GPUs through to the Kubernetes cluster and attempts made in kubernetes-sigs/kind#1886 were rejected. It seems there is a desire to add this support to kind in the future, but disagreements on how to implement it. Sadly I don’t have time to dive into that and try and implement a robust solutions that would be accepted by the kind maintainers, so I decided to quickly hack together a version that I could use right away.

You can find my fork of kind here with a Pull Request that adds GPU support.

My PR adds a gpus config option to kind nodes which passes the --gpus=all flag to Docker. So all you need to use it is the NVIDIA drivers, Docker and the NVIDIA runtime. If you can run this quick test you’re all set.

$ docker run --rm --gpus=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Installing my fork

If you have golang installed you can pull down my fork and build yourself a kind binary with GPU support. Alternatively if you’re on 64 bit linux you can grab this binary I already built.

$ git clone https://github.com/jacobtomlinson/kind.git

$ cd kind

$ git branch gpu && git pull origin gpu

$ make install

$ kind version
kind (@jacobtomlinson's patched GPU edition) v0.18.0-alpha.69+a32cb054c819a1 go1.19.3 linux/amd64

Creating a GPU kind cluster

Create a kind cluster template YAML and specify gpus: true in the control-plane node’s config.

# kind-gpu.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: gpu-test
nodes:
  - role: control-plane
    gpus: true

Then create the cluster.

$ kind create cluster --config kind-gpu.yaml

I like to use the awesome kubectx command to switch my context over to my new kind cluster.

$ kubectx kind-gpu-test

Installing NVIDIA plugins

In order for us to be able to schedule GPUs in our cluster we need the NVIDIA plugins installed. We can do this via the NVIDIA Operator. Let’s install that with helm.

As our host machine already has NVIDIA drivers installed we need to disable the driver install step.

$ helm install --repo https://nvidia.github.io/gpu-operator gpu-operator \
  --wait --generate-name \
  --create-namespace -n gpu-operator \
  --set driver.enabled=false

Testing

Now that we have things set up let’s test that we can schedule a pod with a GPU, we can use the vector add example from earlier.

# gpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: vectoradd
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    resources:
      limits:
        nvidia.com/gpu: 1

$ kubectl apply -f gpu-pod.yaml

If we run kubectl get pods -w we should see our pod go from Pending to ContainerCreating to Running to Completed.

Then if we check the logs we should see the same output as before.

$ kubectl logs vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

What next?

I built a quick version of kind with GPU support so that I can quickly test the thing I’m actually meant to be working on. Ideally I would like to help add this feature into upstream kind for everyone to use, but I don’t have the time right now. Maybe I’ll find some time another day.

Sadly this fork of kind will slowly go out of date unless this feature is added upstream. These are the consequences of forking a project, you own the maintenance, and I do not intend on maintaining this.

You’re welcome to use my forked version, but here be dragons.