IOPS World Record Broken – LINBIT Tops 14.8 million IOPS

In a performance test LINBIT® measured 14.8 million IOPS on a 12 node cluster built from standard off-the-shelf Intel servers. This is the highest storage performance reached by a hyper-converged system on the market, for this hardware basis. Even a small LINBIT storage system can provide millions of IOPS at latencies of a fraction of a millisecond. For real-world applications, these figures correspond to outstanding application performance.

Test setup

LINBIT chose this setup because our competitors have published test results from equivalent systems. So it is easy to compare the strengths of each software offering with fair conditions and the same environment. We worked hard to get the most of the system and made it! Microsoft managed to reach 13.7 million IOPS, and Storpool marginally topped that with 13.8 million IOPS. We reached 14.8 million remote read IOPS – a significant jump of 7.2%! “Those performance numbers mark a milestone in the development of our software. The results prove we speed up High Availability at a large scale”, says CEO Philipp Reisner. The numbers would scale up even further with a larger setup.

These exciting results are for 3-way synchronous replication using DRBD®. The test cluster was provided through the Intel®️ Data Center Builders program. It consists of 12 servers, each running 8 instances of the benchmark, making a total of 96 instances. The setup is hyper-converged, meaning that the same servers are used to run the benchmark and to provide the underlying storage.

For some benchmarks, one of the storage replicas is a local replica on the same node as the benchmark workload itself. This is a particularly effective configuration for DRBD.

DRBD provides a standard Linux block device, so it can be used directly from the host, from a container, or from a virtual machine. For these benchmarks, the workload runs in a container, demonstrating the suitability of LINBIT’s SDS solution, which consists of DRBD and LINSTOR®, for use with Kubernetes.

IOPS and bandwidth results are the totals from all 96 workload instances. Latency results are averaged.

Let’s look into the details!

Top performance with DRBD

5.0 million synchronously replicated write IOPS

This was achieved with a 4K random write benchmark with an IO depth of 64 for each workload. The setup uses Intel® Optane™ DC Persistent Memory to store the DRBD metadata. The writes are 3-way replicated with one local replica and two remote replicas. This means that the backing storage devices are writing at a total rate of 15 million IOPS.

85μs synchronously replicated write latency

This was achieved with a 4K random write benchmark with serial IO. That is, an IO depth of 1. This means that the writes were persisted to all 3 replicas within an average time of only 85μs. DRBD attained this level of performance both when one of the replicas was local and when all were remote. The setup also uses Intel® Optane™ DC Persistent Memory to store metadata.

14.8 million remote read IOPS

This was achieved with a 4K random read benchmark with an IO depth of 64. This corresponds to 80% of the total theoretical network bandwidth of 75GB/s. This result was reproduced without any usage of persistent memory so that the value can be compared with those from our competitors.

10.6 million IOPS with 70/30 mixed read/write

Representing a more typical real-world scenario, this benchmark consists of 70% reads and 30% writes and used an IO depth of 64. One of the 3 replicas was local.

Benefits of persistent memory

DRBD is optimized for persistent memory. When the DRBD metadata is stored on an NVDIMM, write performance is improved.

When the metadata is stored on the backing storage SSD with the data, DRBD can process 4.5 million write IOPS. This increases to 5.0 million when the metadata is stored on Intel® Optane™ DC Persistent Memory instead, an improvement of 10%.

Moving the metadata onto persistent memory has a particularly pronounced effect on the write latency. This metric plummets from 113μs to 85μs with this configuration change. That is, the average write is 25% faster.

Detailed results

Below are the full results for DRBD running on the 12 servers with a total of 96 benchmark workloads.

Benchmark nameWithout local replica With local replica
Random read(higher is better)14,800,000 IOPS 22,100,000 IOPS
Random read/write 70/30(higher is better) 8,610,000 IOPS10,600,000 IOPS
Random write(higher is better) 4,370,000 IOPS 5,000,000 IOPS
Sequential read(higher is better)64300 MB/s 111000 MB/s
Sequential write(higher is better) 20700 MB/s23200 MB/s
Read latency(lower is better) 129 μs 82 μs
Write latency(lower is better)85 μs 84 μs

The IOPS and MB/s values have been rounded down to 3 significant figures.

All volumes are 500GiB in size, giving a total active set of 48,000GiB and consuming a total of 144,000GiB of the underlying storage. The workloads are generated using the fio tool with the following parameters:

Benchmark typeBlock sizeIO depth Workload instancesTotal active IOs

Quality controls

In order to ensure that the results are reliable, the following controls were applied:

  • The entire dataset was written after allocating the volumes, but before running the tests. This prevents artificially fast reads of unallocated blocks. When the backing device driver or firmware recognizes that an unallocated block is being read, it may simply return zeros without reading from the physical medium.
  • The benchmark uses direct IO to bypass the operating system cache and the working set was too large to be cached in memory in any case.
  • The tests were each run for 10 minutes. The metrics stabilized within a small proportion of this time.
  • The measurements were provided by the benchmarking tool itself, rather than being taken from a lower level such as the DRBD statistics. This ensures that the performance corresponds to that which a real application would experience.
  • The random pattern used for the benchmark used a random seed to avoid any bias due to the same blocks being chosen by subsequent test runs.

Software stack

The following key software components were used for these benchmarks:

  • Distribution: CentOS 8.0.1905
  • Kernel: Linux 4.18.0-80.11.2.el8_0.x86_64
  • LVM from distribution kernel
  • DRBD 9.0.21-1
  • Docker 19.03.5
  • Fio 3.7


In this text and at LINBIT, in general, we use the expression 2 replicas to indicate that the data is stored on 2 storage devices. For these tests, there are 3 replicas, meaning that the data is stored on 3 storage devices.

In other contexts, the expression 2 replicas might mean one original plus 2 replicas. That would mean that data would be stored on 3 storage devices.

Test infrastructure

These results were obtained on a cluster of 12 servers made available as part of the Intel® Data Center Builders program. Each server was equipped with the following configuration:

  • Processor: 2x Intel® Xeon Platinum 8280L CPU
  • Memory: 384GiB DDR4 DRAM
  • Persistent memory: 4x 512GB Intel® Optane™ DC Persistent Memory
  • Storage: At least 4x Intel® SSD DC P4510 of at least 4TB
  • Network: Intel® Ethernet Network Adapter XXV710 with dual 25GbE ports

The servers were all connected in a simple star topology with a 25Gb switch.

Speed is of the essence

Storage has often been a bottleneck in modern IT environments. The two requirements speed and high availability have always been in competition. If you aim for maximum speed, the quality of the high availability tends to suffer and vice versa. But with this performance test we demonstrate the best-of-breed open source software-defined storage solution. A replicated storage system that combines high availability and the performance of local NVMe drives is now possible.

This technology enables any public and private cloud builder to deliver high performance for their applications, VMs and containers. If you aim to build a powerful private or public cloud, our solution meets your storage performance needs.

If you want to learn more or have any questions, do contact us at [email protected]