Running a Local LLM in Kubernetes with vLLM & LINSTOR

Running large language model (LLM) workloads in-house rather than consuming them through managed API services is a growing trend. Managed API services are convenient, but token costs grow quickly at scale, latency can be unpredictable, and routing sensitive data through public endpoints raises compliance concerns. A hybrid approach (running open source models locally for high-volume or sensitive workloads while reserving managed API calls for tasks that genuinely need them) is a practical middle ground. This article documents what it took to set up that kind of self-hosted inference stack in a Kubernetes lab environment, with vLLM for inference and LINSTOR® for highly available persistent storage.

Background on vLLM

vLLM is a high-performance, open source inference engine for large language models. It is designed for serving many concurrent requests efficiently in a cluster environment.

An important characteristic of vLLM for this use case is that it exposes an OpenAI-compatible REST API. Anything that already talks to the OpenAI API (LangChain, LlamaIndex, or your own code that calls the OpenAI SDK) can be pointed at a self-hosted vLLM instance with nothing more than a URL change. That compatibility is what makes the hybrid architecture described earlier a viable alternative.

Overview of the setup

Instructions in this article use a Kubernetes cluster with LINSTOR providing persistent storage through the LINSTOR CSI driver. LINSTOR is a software-defined storage solution built on DRBD® that provides replicated block storage across nodes. Replicated storage is a good fit for storing large model weight files that need to survive pod restarts and node failures.

The model used in this blog is meta-llama/Llama-3.2-1B-Instruct, a small but capable model from Meta that has been fine-tuned to follow user instructions. At 1B parameters, it is lightweight enough to run on CPU (important for a lab without dedicated GPU nodes) while still being useful for testing the setup.

Prerequisites

Before deploying anything in Kubernetes, you need access to the model itself. Meta Llama models are gated on Hugging Face, meaning you need to request access before you can download them.

  1. Create an account at huggingface.co.
  2. Navigate to the Llama-3.2-1B-Instruct model page and submit an access request.
  3. After approval, go to your Hugging Face account settings and create an access token with read permissions.
  4. Keep that token nearby because you will need it in the next section.

 

You will also need LINSTOR deployed into Kubernetes along with a StorageClass for Kubernetes workloads to request PersistentVolumeClaims. In the next section you will create a PersistentVolumeClaim from a LINSTOR StorageClass named, linstor-csi-lvm-thin-r2.

Deployment

The deployment consists of three Kubernetes resources: a PersistentVolumeClaim for model storage, a Secret for the Hugging Face token, and a Deployment and Service to run the inference server.

Creating the PVC and secret

The PVC uses the linstor-csi-lvm-thin-r2 storage class, which provisions a thin-provisioned LVM volume with two replicas across the cluster. This provides both redundancy and efficient use of disk space, which is important when model weights can easily consume tens of gigabytes.

The mountPath of the container is /root/.cache/huggingface. This is where the vLLM container caches downloaded model weights. By backing this path with a persistent volume, the model is downloaded from Hugging Face only once, and later pod restarts skip the download entirely because they will persist on the LINSTOR-provided persistent storage.

Enter the following command to create the PVC and the Hugging Face token secret:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  storageClassName: linstor-csi-lvm-thin-r2
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "REPLACE_WITH_YOUR_TOKEN"
EOF

❗ IMPORTANT: Replace REPLACE_WITH_YOUR_TOKEN with your actual Hugging Face access token and change the StorageClass name if yours is different, before applying this configuration.

Deploying vLLM

Enter the following command to deploy the vLLM inference server and its service, adapted from the vLLM Kubernetes deployment documentation:

VLLM_IMAGE=public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: vllm
  template:
    metadata:
      labels:
        app.kubernetes.io/name: vllm
    spec:
      containers:
      - name: vllm
        image: $VLLM_IMAGE
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve meta-llama/Llama-3.2-1B-Instruct --gpu-memory-utilization 0.80"
        ]
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
          - containerPort: 8000
        volumeMounts:
          - name: llama-storage
            mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app.kubernetes.io/name: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
EOF

📝 NOTE: Even on a CPU-only deployment, vLLM uses the --gpu-memory-utilization flag to govern how aggressively it reserves memory for processing requests. Without this flag, vLLM defaults to 92% of available memory, which caused startup failures in my testing environment. Setting it to 0.80 provided enough headroom in my environment for the engine to initialize successfully.

Watching the logs

The first startup takes a few minutes. vLLM needs to download the model weights from Hugging Face (roughly 2.5GB for this model) and initialize the engine. Enter the following command to follow the logs.

kubectl logs -f deployment/vllm-server

After you see something such as INFO: Application startup complete, you are ready to test.

Testing the deployment

The simplest way to test from inside the cluster is to deploy a throwaway curl pod. Enter the following command to create one:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: curl-client
  namespace: default
spec:
  containers:
  - name: curl
    image: curlimages/curl:latest
    command: ["sleep", "infinity"]
  restartPolicy: Never
EOF

Start an interactive shell environment in the pod:

kubectl exec -it curl-client -- sh

And send a request to the vLLM service by using its in-cluster DNS name:

curl http://vllm-server.default.svc.cluster.local:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

If everything is working, you will get back a response in the familiar OpenAI format:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 9,
    "total_tokens": 24
  }
}

When you are done testing, delete the curl pod:

kubectl delete pod curl-client

Pausing the deployment

If you need to pause the deployment while troubleshooting without losing your PVC or configuration, scale it to zero:

kubectl scale deployment vllm-server --replicas=0

💡 TIP: Scaling the deployment to zero keeps the PVC and cached model weights intact. The model will not need to be re-downloaded when you scale the deployment back up.

Bring it back up with:

kubectl scale deployment vllm-server --replicas=1

Conclusion

This lab setup demonstrates the basic pattern: a self-hosted LLM running in Kubernetes, backed by replicated LINSTOR-managed persistent storage, and accessible through a standard OpenAI-compatible API. Caching the model weights on a LINSTOR volume means that pod restarts are fast, and the DRBD-backed replication means that the volume is not a single point of failure.

From here, you could try adding GPU nodes to the cluster for significantly better performance, or training the model further on your own private data to make it more capable for your specific use case. If you’re thinking about deploying hybrid AI, try layering in an inference router to intelligently dispatch requests between the local deployment and a managed API based on cost, latency, or capability requirements. The llm-d project is an open source request router for Kubernetes designed for exactly this kind of setup. Admittedly, most of these “next steps” are concepts I’ve yet to play with, and are only things I’ve read about while “doom-scrolling” in bed at night, so you are on your own there (for now!).

If you have questions about running vLLM or any containerized applications with LINSTOR in Kubernetes, or about other use cases for LINBIT® software in your environment, you can reach out to the LINBIT team or join the LINBIT Community Forum.

Picture of Matt Kereczman

Matt Kereczman

Matt Kereczman is a Solutions Architect at LINBIT with a long history of Linux System Administration and Linux System Engineering. Matt is a cornerstone in LINBIT's technical team, and plays an important role in making LINBIT and LINBIT's customer's solutions great. Matt was President of the GNU/Linux Club at Northampton Area Community College prior to graduating with Honors from Pennsylvania College of Technology with a BS in Information Security. Open Source Software and Hardware are at the core of most of Matt's hobbies.

Talk to us

LINBIT is committed to protecting and respecting your privacy, and we’ll only use your personal information to administer your account and to provide the products and services you requested from us. From time to time, we would like to contact you about our products and services, as well as other content that may be of interest to you. If you consent to us contacting you for this purpose, please tick above to say how you would like us to contact you.

You can unsubscribe from these communications at any time. For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.

By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.

Talk to us

LINBIT is committed to protecting and respecting your privacy, and we’ll only use your personal information to administer your account and to provide the products and services you requested from us. From time to time, we would like to contact you about our products and services, as well as other content that may be of interest to you. If you consent to us contacting you for this purpose, please tick above to say how you would like us to contact you.

You can unsubscribe from these communications at any time. For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.

By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.