Over the many years that I’ve worked for LINBIT, I cannot tell you how many times I’ve been asked, “How does a LINSTOR® (and/or DRBD®) cluster compare to a Ceph storage cluster?” Hopefully I can help answer that question in this blog post for anyone who might Google this in the future.
My short answer could simply be, “it doesn’t”. That’s not completely fair though, so my actual answer would be, “you shouldn’t use LINSTOR where it makes sense to use Ceph and you shouldn’t use Ceph where it makes sense to use LINSTOR”. I believe the comparison is attempted as often as it is because you could use Ceph and LINSTOR interchangeably in a lot of use cases and accomplish something close to what you sought out to achieve. The finer details are what need to be considered though, not the end result.
Differences in Block Devices and Distribution of Storage
LINSTOR partitions physical storage using LVM or ZFS, and layers DRBD on top; resulting in a replicated block device. Ceph’s Rados Block Device (RBD) creates storage objects distributed within RADOS which are then presented as a block device to clients. In both situations you end up with a block device tolerant to single node failures, apples and apples, right? Not quite.
LINSTOR provisions it’s DRBD block device resources on LVM/ZFS partitions in a RAID-1 like manner across cluster nodes. Each node assigned with a replica of a resource has a full copy of that resource’s data. It also stores DRBD’s metadata at the end of the backing block device, behind the space available to the filesystem or application using the resource. This architecture means you can remove all LINSTOR and DRBD components from all systems in the cluster and you’d still have access to your unaltered data via the backing LVM or ZFS partition. Read balancing across replicas is a feature of DRBD, but because of the RAID-1 like replication across nodes you’re typically going to have only 2-3 replicas for each resource, so scaling reads by adding replicas is expensive in terms of disk space. Write latency and throughput, on the other hand, are going to greatly benefit from the RAID-1 like replication since there’s no need to calculate how to stripe writes across the cluster’s nodes.
Ceph backs it’s block devices within RADOS objects, the object storage system that backs all of Ceph’s storage solutions (RBD, CephFS, and RADOSGW). Ceph’s RBD stripes and replicates data across the distributed object storage using the CRUSH (Controlled Replication Under Scalable Hashing) algorithm. This uses less physical storage than LINSTOR’s RAID-1 style replication, but it also spreads your data around the cluster in a pseudo-random fashion. The CRUSH algorithm’s distribution of data makes Ceph’s block devices highly scalable, but realistically impossible to reassemble without Ceph operating normally. Scaling the Ceph cluster does increase the read and write performance by adding more RADOS objects to read from and write to, but there is computational overhead in determining where those reads and writes are happening which will have a noticeable impact on latency and throughput.
Differences in Filesystem Access (or access in general)
LINSTOR’s DRBD devices can be formatted and mounted by any satellite in the LINSTOR cluster. Ceph’s object storage can be mounted using CephFS on any node able to authenticate to the Ceph cluster. In both cases you have a filesystem accessible over a network, oranges and oranges, right? Not really.
LINSTOR’s DRBD resources are (usually) readable and writable from only one node at a time, but DRBD can be promoted “on-demand” (eg. when mounted) as long as no other node is currently accessing that resource. LINSTOR’s drivers for various cloud platforms will temporarily promote more than one node to support live migrations of VMs, but this is done in a controlled manner making it safe without taking extra measures like shared filesystems and cluster-wide locking mechanisms. To some, this Primary/Secondary, or Active/Passive, approach is viewed as a limitation, but this is one of the reasons LINSTOR’s DRBD devices perform as well as the physical storage used to back them. That makes DRBD a good fit for database workloads, persistent message queues, and virtual machine root disks. This also enables the user to choose which filesystem they’d like to format their DRBD device with, which can make a difference in application performance when you know your IO patterns.
CephFS is Ceph’s POSIX-compliant filesystem that allows client systems to mount RADOS object storage just like any other Linux filesystem, and multiple clients can mount the same CephFS at the same time. Concurrent access to the same filesystem from many hosts is a must have for things like disk-image stores on hypervisors, web farms, and other larger-filesized data that tends to be read more frequently than it is written. If NFS works for your use case, CephFS can most likely work there as well. CephFS implements it’s own locking, and cannot magically avoid the overhead that comes with this requirement for concurrence. This makes CephFS less suitable for performance demanding applications that make frequent small writes, like databases or persisted messaging queues.
Both Ceph and LINSTOR will provide you with resilient storage, and both are fully open-source. Both have operators to make deployment simple in Kubernetes, and both have upstream OpenStack drivers. Both have snapshotting and disaster recovery capabilities. So, there may be more similarities than I first led on, but how these things work “under the hood” make them very different solutions for solving different problems.
Ceph’s use of the CRUSH algorithm will distribute storage more efficiently than LINSTOR’s RAID-1 like distribution. The downside to CRUSH is a complete loss of all data when Ceph isn’t operating normally. While a catastrophic failure of a LINSTOR cluster can be recovered by most Linux admins without extensive LINSTOR specific knowledge. If replication overhead is a bigger concern than the system’s complexity, consider Ceph. If the complexity of the system makes you feel uneasy about recovering from failure scenarios, consider LINSTOR.
If you require concurrent access to the same data that’s read from more frequently than it’s written to, or write latency isn’t the most important metric, consider Ceph. If you require frequent low latency writes, or flexibility in your filesystems, you should consider LINSTOR (component of LINBIT SDS).
Of course, if you need both Ceph and LINSTOR for different applications or workloads there’s nothing stopping you from using both!