Running large language model (LLM) workloads in-house rather than consuming them through managed API services is a growing trend. Managed API services are convenient, but token costs grow quickly at scale, latency can be unpredictable, and routing sensitive data through public endpoints raises compliance concerns. A hybrid approach (running open source models locally for high-volume or sensitive workloads while reserving managed API calls for tasks that genuinely need them) is a practical middle ground. This article documents what it took to set up that kind of self-hosted inference stack in a Kubernetes lab environment, with vLLM for inference and LINSTOR® for highly available persistent storage.
Background on vLLM
vLLM is a high-performance, open source inference engine for large language models. It is designed for serving many concurrent requests efficiently in a cluster environment.
An important characteristic of vLLM for this use case is that it exposes an OpenAI-compatible REST API. Anything that already talks to the OpenAI API (LangChain, LlamaIndex, or your own code that calls the OpenAI SDK) can be pointed at a self-hosted vLLM instance with nothing more than a URL change. That compatibility is what makes the hybrid architecture described earlier a viable alternative.
Overview of the setup
Instructions in this article use a Kubernetes cluster with LINSTOR providing persistent storage through the LINSTOR CSI driver. LINSTOR is a software-defined storage solution built on DRBD® that provides replicated block storage across nodes. Replicated storage is a good fit for storing large model weight files that need to survive pod restarts and node failures.
The model used in this blog is meta-llama/Llama-3.2-1B-Instruct, a small but capable model from Meta that has been fine-tuned to follow user instructions. At 1B parameters, it is lightweight enough to run on CPU (important for a lab without dedicated GPU nodes) while still being useful for testing the setup.
Prerequisites
Before deploying anything in Kubernetes, you need access to the model itself. Meta Llama models are gated on Hugging Face, meaning you need to request access before you can download them.
- Create an account at huggingface.co.
- Navigate to the Llama-3.2-1B-Instruct model page and submit an access request.
- After approval, go to your Hugging Face account settings and create an access token with read permissions.
- Keep that token nearby because you will need it in the next section.
You will also need LINSTOR deployed into Kubernetes along with a StorageClass for Kubernetes workloads to request PersistentVolumeClaims. In the next section you will create a PersistentVolumeClaim from a LINSTOR StorageClass named, linstor-csi-lvm-thin-r2.
Deployment
The deployment consists of three Kubernetes resources: a PersistentVolumeClaim for model storage, a Secret for the Hugging Face token, and a Deployment and Service to run the inference server.
Creating the PVC and secret
The PVC uses the linstor-csi-lvm-thin-r2 storage class, which provisions a thin-provisioned LVM volume with two replicas across the cluster. This provides both redundancy and efficient use of disk space, which is important when model weights can easily consume tens of gigabytes.
The mountPath of the container is /root/.cache/huggingface. This is where the vLLM container caches downloaded model weights. By backing this path with a persistent volume, the model is downloaded from Hugging Face only once, and later pod restarts skip the download entirely because they will persist on the LINSTOR-provided persistent storage.
Enter the following command to create the PVC and the Hugging Face token secret:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models
spec:
storageClassName: linstor-csi-lvm-thin-r2
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
stringData:
token: "REPLACE_WITH_YOUR_TOKEN"
EOF
âť—Â IMPORTANT:Â ReplaceÂ
REPLACE_WITH_YOUR_TOKENÂ with your actual Hugging Face access token and change the StorageClass name if yours is different, before applying this configuration.
Deploying vLLM
Enter the following command to deploy the vLLM inference server and its service, adapted from the vLLM Kubernetes deployment documentation:
VLLM_IMAGE=public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: vllm
template:
metadata:
labels:
app.kubernetes.io/name: vllm
spec:
containers:
- name: vllm
image: $VLLM_IMAGE
command: ["/bin/sh", "-c"]
args: [
"vllm serve meta-llama/Llama-3.2-1B-Instruct --gpu-memory-utilization 0.80"
]
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
volumeMounts:
- name: llama-storage
mountPath: /root/.cache/huggingface
volumes:
- name: llama-storage
persistentVolumeClaim:
claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app.kubernetes.io/name: vllm
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
EOF
📝 NOTE: Even on a CPU-only deployment, vLLM uses theÂ
--gpu-memory-utilization flag to govern how aggressively it reserves memory for processing requests. Without this flag, vLLM defaults to 92% of available memory, which caused startup failures in my testing environment. Setting it toÂ0.80 provided enough headroom in my environment for the engine to initialize successfully.
Watching the logs
The first startup takes a few minutes. vLLM needs to download the model weights from Hugging Face (roughly 2.5GB for this model) and initialize the engine. Enter the following command to follow the logs.
kubectl logs -f deployment/vllm-server
After you see something such as INFO: Application startup complete, you are ready to test.
Testing the deployment
The simplest way to test from inside the cluster is to deploy a throwaway curl pod. Enter the following command to create one:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: curl-client
namespace: default
spec:
containers:
- name: curl
image: curlimages/curl:latest
command: ["sleep", "infinity"]
restartPolicy: Never
EOF
Start an interactive shell environment in the pod:
kubectl exec -it curl-client -- sh
And send a request to the vLLM service by using its in-cluster DNS name:
curl http://vllm-server.default.svc.cluster.local:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
If everything is working, you will get back a response in the familiar OpenAI format:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 9,
"total_tokens": 24
}
}
When you are done testing, delete the curl pod:
kubectl delete pod curl-client
Pausing the deployment
If you need to pause the deployment while troubleshooting without losing your PVC or configuration, scale it to zero:
kubectl scale deployment vllm-server --replicas=0
💡 TIP: Scaling the deployment to zero keeps the PVC and cached model weights intact. The model will not need to be re-downloaded when you scale the deployment back up.
Bring it back up with:
kubectl scale deployment vllm-server --replicas=1
Conclusion
This lab setup demonstrates the basic pattern: a self-hosted LLM running in Kubernetes, backed by replicated LINSTOR-managed persistent storage, and accessible through a standard OpenAI-compatible API. Caching the model weights on a LINSTOR volume means that pod restarts are fast, and the DRBD-backed replication means that the volume is not a single point of failure.
From here, you could try adding GPU nodes to the cluster for significantly better performance, or training the model further on your own private data to make it more capable for your specific use case. If you’re thinking about deploying hybrid AI, try layering in an inference router to intelligently dispatch requests between the local deployment and a managed API based on cost, latency, or capability requirements. The llm-d project is an open source request router for Kubernetes designed for exactly this kind of setup. Admittedly, most of these “next steps” are concepts I’ve yet to play with, and are only things I’ve read about while “doom-scrolling” in bed at night, so you are on your own there (for now!).
If you have questions about running vLLM or any containerized applications with LINSTOR in Kubernetes, or about other use cases for LINBIT® software in your environment, you can reach out to the LINBIT team or join the LINBIT Community Forum.