LINSTOR® is the LINBIT® SDS solution for managing Linux block storage. If you’ve used LINSTOR, you know how many knobs can be turned when configuring it. If you’ve followed along with one of our quickstart blogs or a README in one of LINBIT’s GitHub repositories, you’ve probably set up a LINSTOR cluster without much consideration for optimizing performance. Most of our blog posts and quickstarts are geared towards introducing the reader to a project or feature, as opposed to throwing the reader into the deep end. This post, however, will cover those topics and get us at least waist deep in the world of storage performance for Kubernetes with LINSTOR.
Standard Deployments and Their Expectations
Before jumping into what could be tuned, we should define what a “standard issue” LINSTOR deployment in Kubernetes could look like. One of the most straightforward ways to deploy LINSTOR into Kubernetes is by simply giving the LINSTOR operator the name of an empty block device (/dev/vdb
in our example) and letting LINSTOR set it up as a LINSTOR storage pool for you. At helm install
time this is done by defining your LINSTOR operator’s Helm values like this:
operator
storagePools:
lvmPools:
- name: lvm-thick
volumeGroup: drbdpool
devicePaths:
- /dev/vdb
If the above settings were in a file named linstor-op-vals.yaml
, then you’d deploy LINSTOR into Kubernetes using Helm like so:
$ helm install -f ./linstor-op-vals.yaml linstor-op linstor/linstor
Those settings would result in an LVM volume group named drbdpool
being created on a block device named /dev/vdb
attached to your worker nodes, which would then be added to LINSTOR as a storage pool named lvm-thick
. You could then define a LINSTOR StorageClass
in Kubernetes that references this storage pool with a definition like this:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: "linstor-csi-lvm-thick-r2"
provisioner: linstor.csi.linbit.com
parameters:
autoPlace: "2"
storagePool: "lvm-thick"
reclaimPolicy: Retain
allowVolumeExpansion: true
With these configurations applied, your Kubernetes users will be able to request persistent volumes (PV) from the linstor-csi-lvm-thick-r2
StorageClass. When they do, each PV provisioned by LINSTOR from this StorageClass will result in an LVM logical volume of the requested size within the drbdpool
volume group, which will be used as backing storage for a DRBD volume that is replicating to a single peer in the cluster with the same storage topology provisioned by LINSTOR.
For many users this standard deployment could be satisfactory, and there’s something to be said about keeping things simple. Having a PV replicated between two peers in the cluster ensures availability and resilience, and since LINSTOR provisions block storage replicated by DRBD, the overhead is limited and performance should be decent out-of-the-box.
That said, there’s always room for improvement. The following section will cover what I’ve found to offer the best performance when using LINSTOR to provide hyperconverged storage in Kubernetes.
Best Practices for Performance Tuned Deployments
From the lowest layer (hardware) to the top (file system options) your choices will have an effect on performance.
Physical or Cloud Storage Selection
Not much you can tune here but I feel like I have to mention the underlying storage.
When you’re purchasing storage for physical deployments or selecting your storage options for a cloud deployment, you’ll never be able to read or write faster than the underlying physical medium you choose. You should have a good understanding of your application’s requirements in terms of IOPS and throughput and choose the appropriate storage option within your budget. No software setting will bend space-time and make your hardware work faster than it was designed to, so it’s important to know you’re building on a solid foundation.
Cloud storage tiers are easier to move between but the biggest leaps in storage performance usually involve moving to a more expensive cloud instance type. For that reason it’s important to understand what your upgrade path looks like, on both the monetary and operations side.
If you find yourself needing more than your current storage is capable of, it’s certainly not impossible to move volumes between nodes or tiers of storage in LINSTOR once you’ve outgrown things; LINSTOR makes it pretty easy to do so.
Choosing your Storage Pool Provider
Once you have your physical storage attached to your cluster nodes, you’ll need to add it to a LINSTOR storage pool. Which storage pool provider you choose will have an impact on features and performance. Some options only make sense for very specific sets of hardware (like Exos and OpenFlex), so we’ll only be looking at the hardware agnostic providers: LVM and ZFS.
LVM versus LVM Thin
LVM can be setup in LINSTOR as a thick or thin LVM pool in LINSTOR, meaning the volumes LINSTOR creates will either be initialized to the size requested upon provisioning or the device will grow as it is used, respectively.
Thick LVM will perform better than thin LVM under I/O sensitive workloads because of its pre-allocation of blocks. However, thick LVM performance suffers badly when there is a snapshot of the volume attached to it, so much so that LINSTOR does not support thick LVM snapshots. Thin LVM allocates blocks as they’re needed which involves additional I/O. That additional I/O adds up under an application that makes frequent small writes.
ZFS versus ZFS Thin
ZFS, or more technically zvols created from a zfs pool (zpool), can be used to back LINSTOR volumes as well. Under the hood, “thick” versus “thin” provisioned zvols really only differ in that the space requested is either reserved for them, or not. This means that you’re really choosing the ability to overprovision your host’s storage when you choose the thin ZFS provider for your storage pool in LINSTOR; performance isn’t a concern here. Furthermore, LINSTOR supports snapshots of volumes provisioned from thin and thick provisioned ZFS backed storage pools.
ZFS support in Linux distributions is not as common as LVM yet. This is something to consider when designing your cluster, but that’s a topic for another blog.
Actual Numbers
Theories aside, I ran a quick test using FIO on some AWS instances with general purpose EBS volumes (gp3) to back each of the storage providers discussed above. EBS’ gp3 volumes deliver a baseline of 3000 IOPS. Each LINSTOR volume tested was replicating synchronously between the same three availability zones in the us-west-2 region. The FIO command and results are listed below:
echo 3 > /proc/sys/vm/drop_caches
fio -name fio-test --filename /dev/drbd$i --ioengine libaio --direct 1 \
--rw randwrite --bs 4k --runtime 30s --numjobs 4 --iodepth=32 \
--group_reporting --rwmixwrite=100
Thick ZFS | Thin ZFS | Thick LVM | Thin LVM | |
---|---|---|---|---|
IOPS | 2098 | 1984 | 3093 | 1650 |
This was a single simple test to benchmark small writes to a single volume, but it does support our theory. Thick LVM performed the best in this test, much better than its thin counterpart. While thin versus thick ZFS performed similarly to one another.
If you are only considering performance and are fine not having features like snapshots and snapshot shipping, you can select thick LVM for your storage pool provider, follow the most standard deployment steps, and call it a day. However, with a little tuning you can have your cake and eat it too.
Tuning Settings and Topologies for Storage Performance
There are plenty of knobs to turn on LINSTOR to maximize the performance of your Kubernetes storage while also supporting features like snapshots, cloning, and overprovisioning. The following sections will focus on different areas for tuning using the Thin LVM storage provider in LINSTOR since it was the lowest performer in our test.
Physical Storage Topology
DRBD keeps track of dirty blocks in its own metadata, which by default, is stored at the end of the block device used for its backing storage. That means, there are times when writes to a DRBD® volume will cause multiple writes to the same underlying storage. If your underlying storage has a fixed amount of bandwidth, which it does, DRBD will be using some of what could be used by your application. Alternatively, it’s possible to configure LINSTOR so its DRBD volumes use a separate block device, or “external metadata” in DRBD terminology, in order to give your application dedicated access to your storage’s bandwidth.
To use external metadata in LINSTOR, you’ll need a separate block device attached to each of your hosts for DRBD’s metadata in addition to the block device being used by LINSTOR to provision persistent volumes. Then, when you’re deploying LINSTOR into your Kubernetes cluster, you’ll tell LINSTOR to set up LVM on this volume, and add it to LINSTOR as a storage pool. In the example deployment below, assume that /dev/nvme2n1
is a larger NVMe that will be our storage pool for provisioning persistent volumes, while /dev/nvme1n1
is a smaller NVMe that will be used as a storage pool for DRBD’s metadata. Following the deployment example at the top of this post, populate the linstor-op-vals.yaml
file with the following options and deploy using Helm.
operator:
satelliteSet:
storagePools:
lvmThinPools:
- name: ext-meta-pool
thinVolume: metapool
volumeGroup: ""
devicePaths:
- /dev/nvme1n1
- name: lvm-thin
thinVolume: thinpool
volumeGroup: ""
devicePaths:
- /dev/nvme2n1
Then, in your StorageClass
definition for Kubernetes, set the StorageClass parameter, property.linstor.csi.linbit.com/StorPoolNameDrbdMeta
, to the name of the external metadata pool, ext-meta-pool
.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: "linstor-csi-lvm-thin-r2"
provisioner: linstor.csi.linbit.com
parameters:
autoPlace: "2"
storagePool: "lvm-thin"
property.linstor.csi.linbit.com/StorPoolNameDrbdMeta: "ext-meta-pool"
reclaimPolicy: Retain
allowVolumeExpansion: true
When volumes are requested from this StorageClass
, LINSTOR will create the data volume and the DRBD metadata volume in separate storage pools backed by separate physical devices, therefore dedicating your data volume’s performance to your application.
Collocate Persistent Volumes with Pods
By default, there is no guarantee that a pod will be scheduled on a worker node that has a physical replica of the persistent volume it’s using to store its data. That means the pod would be reaching over the network to perform I/O operations, which means additional latency. This “diskless attachment” (DRBD specific terminology) is sometimes desired, or even required, but for latency sensitive applications like databases, you’ll want to keep latencies as low as possible.
LINSTOR for Kubernetes is topology aware, so it’s only a matter of setting the correct options to enforce a “local access only” policy on a specific StorageClass. LINSTOR supports Stork as well, but using LINSTOR’s native HA Controller in combination with the CSI topology feature gate (which is enabled by default in recent Kubernetes versions) and some StorageClass parameters is the preferred method.
I’ll build on our deployment options from the section above by adding the stork
, csi.enableTopology
, and haController
values below:
operator:
satelliteSet:
storagePools:
lvmThinPools:
- name: ext-meta-pool
thinVolume: metapool
volumeGroup: ""
devicePaths:
- /dev/nvme1n1
- name: lvm-thin
thinVolume: thinpool
volumeGroup: ""
devicePaths:
- /dev/nvme2n1
stork:
enabled: false
csi:
enableTopology: true
haController:
replicas: 3
Pairing those deployment options with the following StorageClass
definition will tell LINSTOR to wait for a pod to be scheduled before provisioning the necessary persistent volume, and provision one physical replica on the node the pod was scheduled on. I’ve added thevolumeBindingMode: WaitForFirstConsumer
option, and the allowRemoteVolumeAccess: “false”
parameter to the previous example:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: "linstor-csi-lvm-thin-r2"
provisioner: linstor.csi.linbit.com
parameters:
allowRemoteVolumeAccess: "false"
autoPlace: "2"
storagePool: "lvm-thin"
property.linstor.csi.linbit.com/StorPoolNameDrbdMeta: "ext-meta-pool"
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
This will ensure the lowest latency access to the persistent volumes created from this StorageClass. Also, if something happens to the node running your application, LINSTOR’s HA Controller will reschedule the pod on the other node with the second replica of the volume, and will even do so more quickly than the default pod rescheduling mechanisms in Kubernetes.
Tuning DRBD
Lastly, we can tune DRBD settings using parameters on our StorageClasses. All the typical DRBD tunings can happen in the storageClass definitions, and there are many, but we’ll only focus on three.
parameters:
[...]
DrbdOptions/Disk/disk-flushes: "no"
DrbdOptions/Disk/md-flushes: "no"
DrbdOptions/Net/max-buffers: "10000"
If your physical storage is attached using battery backed write caches, or if you’re running in the cloud where we can assume this is true, we can disable some of the safety features in DRBD that aren’t needed. Also, configuring max-buffers
to 10k
will allow DRBD more buffer space which has a positive effect on resync times should anything interrupt the replication network or should a host reboot and require a background resync when it returns.
Nothing changes in the Helm options used to deploy the LINSTOR operator when tuning DRBD settings. Only the StorageClass
definition from our previous example needs modification, specifically by adding the DrbdOptions
above to the list of parameters.
Conclusion
I’ll wrap this blog post up by summarizing the topics covered throughout. Ensuring your physical storage is capable of satisfying your applications’ demands is the bedrock of your Kubernetes clusters’ storage performance. If you can, separate the data storage pool from your metadata storage pool using LINSTOR to maximize the write throughput available to your application. To make sure latency is kept to a minimum, deploy LINSTOR with CSI topology and HA Controller features enabled, while making sure to set your storage class options to wait for pod scheduling (VolumeBinding: WaitForFirstConsumer
) and disallow remote attachment (allowRemoteVolumeAccess: "false"
). Turning off some of DRBD’s safety nets when it’s safe to do so as well as giving DRBD some extra buffer space can help with both write performance and resync speeds.
Following these guidelines, or at least knowing these knobs exist, should help you in your quest for achieving the best performance for your LINSTOR persistent storage in Kubernetes. For more information on anything above, see the LINSTOR and DRBD documentation, join our Slack community, or reach out to us directly to schedule a call! LINBIT is here to help 24/7/365.