The release of DRBD® 9.3.1 brings with it a “streaming I/O” feature. This feature means that DRBD can allocate compound pages from the kernel as I/O buffers. When applications make large I/O requests or streaming accesses, using compound pages can improve performance because DRBD will use CPU resources more efficiently.
Background on DRBD memory allocation
Historically, LINBIT® developers have optimized DRBD for replicating data for high availability use cases, with a focus on workloads that have small, frequent I/O patterns. While DRBD is not limited to replicating data writes with these I/O patterns, making applications such as databases and messaging queues highly available is a common use case for DRBD, because its performance excels in this area. There are, of course, use cases that make large I/O requests, such as video streaming and transcoding, large file copies, and others.
Linux supports I/O requests from 512 bytes to 1MiB in 512-byte increments. Before DRBD 9.3.1, DRBD handled its memory allocation by requesting one 4 KiB page at a time. This could mean up to 256 allocations for a 1 MiB write.
The state of storage and networks today is not what it was more than 25 years ago when LINBIT CEO Philipp Reisner created DRBD as a master thesis project in 1999 while attending university. Networking speeds are reaching 400Gbps, and 800Gbps is being standardized in IEEE 802.3df. Storage device technology, for example, SSD and NVMe, is getting faster with each generation. Technical advances in these hardware categories highlight the importance of DRBD becoming more CPU-efficient and not becoming a bottleneck for highly available applications.
Background on compound pages in the Linux kernel
For technical background about compound pages, you can read an interesting article about the pagemap process. As a brief summary, compound pages are 2^n “physically contiguous pages”. The Linux kernel uses compound pages similarly to large pages supported by hardware. Compound pages efficiently group smaller contiguous pages into a single huge page, making page table operations more CPU-efficient.
With the streaming I/O feature introduced in DRBD 9.3.1, DRBD has a new memory allocation strategy. DRBD will now attempt a single higher-order allocation first (try to use compound pages), and fall back to allocating order-0 pages (the traditional DRBD memory allocation strategy) on failure.
This DRBD memory allocation strategy will reduce DRBD CPU use, particularly on secondary nodes. Secondary nodes need to do fresh memory allocations with every incoming write replicated from the primary node. The primary node does not need to allocate a receive memory buffer for its own writes. The primary node only needs to read from a buffer already created by an application doing the data writes.
📝 NOTE: For pre-5.1 kernels, DRBD will always allocate order-0 pages, to prevent a buffer overflow risk because of the lack of a
bio_for_each_bveckernel helper for handling multi-page allocations.
Early performance benchmark test results
DRBD 9.3.1 is already used on LINBIT internal production systems. To compare the impact of the streaming I/O DRBD feature, LINBIT Solutions Architect Matt Kereczman ran the following before-and-after benchmark test on a system with an Intel® Xeon® Silver 4112 CPU @ 2.60GHz. The backing storage device Kereczman used to run the fio tests was a Samsung M.2 NVMe PCIe SM963, model number MZQKW480HMHQ-00003.
Fio command used as a benchmark test on both systems:
fio --name=test0 \
--readwrite=write \
--bs=1m \
--direct=1 \
--numjobs=1 \
--filename=/dev/drbd10
Before running both tests, output from a drbdadm status command run on node0 (primary role) showed node1 in a secondary role:
r0 role:Primary
disk:UpToDate open:no
node1 role:Secondary
peer-disk:UpToDate
Summary of parameters:
| Parameter | DRBD 9.3.0-rc.1 | DRBD 9.3.1 |
|---|---|---|
| DRBD Version | 9.3.0-rc.1 | 9.3.1 |
| Transport | TCP | TCP |
| Read/Write Mode | Write (Sequential, 1MB block) | Write (Sequential, 1MB block) |
| I/O Engine | psync | psync |
| I/O Depth | 1 | 1 |
| Total Data Transferred | 40.0 GiB | 40.0 GiB |
Summary of fio benchmark test results:
| Metric | DRBD 9.3.0-rc.1 | DRBD 9.3.1 | Delta |
|---|---|---|---|
| Throughput (MiB/s) | 816 MiB/s | 882 MiB/s | +8.09% |
| Throughput (MB/s) | 855 MB/s | 925 MB/s | +8.19% |
| IOPS (avg) | 816 | 882 | +8.09% |
| Run time | 50,210 ms | 46,434 ms | −7.52% |
| clat avg (usec) | 1,179.33 | 1,089.33 | −7.63% |
| clat min (usec) | 803 | 690 | −14.07% |
| clat max (usec) | 21,899 | 9,692 | −55.74% |
| clat stdev (usec) | 412.20 | 471.61 | +14.41% |
| lat avg (usec) | 1,223.73 | 1,131.71 | −7.52% |
| clat p50 (usec) | 1,123 | 979 | −12.82% |
| clat p90 (usec) | 1,319 | 1,270 | −3.71% |
| clat p99 (usec) | 3,032 | 3,392 | +11.87% |
| clat p99.9 (usec) | 5,473 | 5,604 | +2.39% |
| clat p99.99 (usec) | 12,649 | 7,701 | −39.12% |
| BW avg (KiB/s) | 836,075 | 903,991 | +8.12% |
| BW stdev (KiB/s) | 56,337 | 9,969 | −82.30% |
| IOPS stdev | 55.02 | 9.74 | −82.30% |
| CPU usr | 3.77% | 3.71% | −0.06pp |
| CPU sys | 7.30% | 7.55% | +0.25pp |
| drbd10 util | 92.01% | 91.90% | −0.11pp |
| nvme0n1 util | 42.86% | 41.41% | −1.45pp |
These benchmark test results show the positive impact that the DRBD streaming I/O feature had, including:
- About an 8% higher average throughput and correspondingly shorter test run time
- Significantly lower and more consistent latency at typical percentiles (p50 improved by about 13%)
- Significantly more stable throughput, as indicated by the bandwidth standard deviation dropping by 82%, seeming to show that DRBD 9.3.1 with the streaming I/O feature sustained its speed much more evenly throughout the test
- Much better worst-case latency, where the absolute maximum completion latency was reduced by more than half, from 21.9 ms to 9.7 ms
The one minor caveat is a slight uptick at the p99 latency tail (3,392 compared with 3,032 µs).
📝 NOTE: Performance benchmark results will vary depending on the hardware in systems. The benefits of the DRBD streaming I/O feature might be much higher or lower depending on the ratio of CPU speed compared to network and backing storage device performance.
Independent benchmarking results
A SIOS (LINBIT Japanese partner) engineer has also shared some exciting early results with the LINBIT team. After running benchmarking tests comparing DRBD 9.2.7 and 9.3.1, using a RAM disk and fio, the partner confirmed an increase from about 1.4 GB/s to 2.0 GB/s and “a drastic reduction in latency.”
Conclusion
The DRBD streaming I/O optimization improves how DRBD allocates memory for receiving a write request. Before, DRBD allocated 4KiB (4096 bytes) pages, until it had enough to buffer the write request. That means for a 1MiB write request, DRBD made 256 4KiB allocations. With the new code in DRBD 9.3.1, DRBD now tries to allocate 1MiB in a single kernel call. That takes less time and consumes fewer CPU cycles. The effect becomes more significant if the CPU is slow, or the network and backing block device throughput is very high.
As mentioned earlier, the LINBIT team now uses DRBD 9.3.1 exclusively on its internal production systems. If you are running an earlier DRBD version, the LINBIT team invites you to upgrade to realize the potential performance benefits in your deployments. Let us know your benchmarking results if you do any before-and-after testing, or share them with the community of LINBIT software users in the LINBIT Community Forum.