Independent Performance Testing of DRBD by E4

This is a guest post by Davide Obbi, Storage Specialist at E4.

In 2024, the Italy-based E4 Computer Engineering company evaluated DRBD® for use in their high performance computing (HPC) ecosystem. E4 is a company that since 2002, has specialized in building HPC systems for academic and institutional research environments, such as CERN. More recently, the company is applying its expertise and experience to the emerging HPC markets in artificial intelligence (AI), aerospace and defense, and data science. E4 prides itself on providing “mature and reliable solutions in sophisticated technological environments” which could be why researchers at E4 chose to evaluate DRBD, the Linux kernel module that has been helping people create high availability (HA) storage clusters for more than two decades.

Similar to their approach to building sophisticated HPC systems, E4 took a sophisticated and detail-oriented approach to their performance testing of DRBD. In their conclusion, DRBD was easy to implement and to make configuration changes, such as implementing a CPU mask, to gain some performance benefits. Furthermore, their results showed that DRBD had very little impact on CPU use, which bodes well for using DRBD in a hyperconverged HPC environment.

Davide Obbi, Storage Specialist at E4, conducted the performance testing in the E4 lab environment. Testing procedures, observations, and results follow, in Davide’s own words, with only some minor copy editing from LINBIT® to improve clarity.

LINBIT DRBD Performance Evaluation on NVMe Drives

Scope

Evaluate the performances of LINBIT HA DRBD replication mechanism on the HW nodes available in the lab.

Setup

Server Model

SYS-2029BT-HNTR

FW and BIOS

  • 01.74.13 and 4.0

  • Virtualization feature disabled

  • Intel Power Technology (C States) disabled

  • Hyperthreading enabled

CPU

2x Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz 16 cores (HT enabled)

Drives

OS and Kernel

RHEL8.9 and 4.18.0-513.11.1.el8_9.x86_64

LINBIT DRBD

9.2.7

Adapter

MT27800 Family [ConnectX-5] 2x ports 100Gb/s Eth: direct attach

MOFED and fw

MOFED: 16.35.3502 and 23.10-1.1.9.0

Fio version

3.19

Data Collection

Performance test results are recorded as absolute values by Fio output and over time by using Fio plotting plus Prometheus Node Exporter.

Test Preparation

Initialize the drives on each node:

blkdiscard /dev/nvme0n1
blkdiscard /dev/nvme1n1
blkdiscard /dev/nvme2n1

Precondition the drives on each node:

[global]
description=Precondition all NVME
ioengine=libaio
direct=1
readwrite=write
bs=1M
numjobs=1
iodepth=32
group_reporting=0
size=100%

[nvme0n1_precondition]
filename=/dev/nvme0n1
[nvme1n1_precondition]
filename=/dev/nvme1n1
[nvme2n1_precondition]
filename=/dev/nvme2n1

Testing with KCD6 drives

Fio Job File

[global]
description=KCD61VUL3T20
ioengine=libaio
time_based=1
startdelay=60
ramp_time=60
runtime=300
direct=1

NVMe Drive Test Results

Absolute results:

sequential reads 1M bs GiB/s~3.2
random reads 4K IOPS~780K
sequential writes 1M bs GiB/s~2.3
random writes 4K IOPS~300K (170K by vendor)

Over time sequential and random reads results are displayed in sequence below:

All NVMe drives experienced a performance drop during execution of the random writes test caused by the write amplification factor (WAF) of the drives. It is difficult to see it with the bare eye, but there are 32 colored lines in the graph below, each representing an I/O job for a total of around 300K IOPS per device after the drop point. Because KIOXIA advertises the drive to be capable of 170K IOPS with random write workloads we expect the value to drop more with further utilization:

Interestingly the same happens with sequential writes, where the aggregated bandwidth is reduced by half; if we had to repeat the tests, the same behavior is experienced:

DRBD config

The test run on the /dev/drbd* devices present on the “primary“ node after the drbdadm status command reported all the resources up-to-date. Configuration for the DRBD resource:

# cat /etc/drbd.d/r0.res
resource "r0" {
  volume 0 {
    device minor 1;
    disk "/dev/nvme0n1";
    meta-disk internal;
  }
  volume 1 {

📝 NOTE: It was not possible to use RDMA with MOFED because we were unable to compile DRBD successfully.

📝 NOTE: So far no DRBD tuning was done.

DRBD test results

Absolute results:

 NVMe deviceDRBD device
sequential reads 1M bs GiB/s~3.2~3.2

~5.1 with ReadBalancing

random reads 4K IOPS~780K~410K
sequential writes 1M bs GiB/s~2.3~2.3
random writes 4K IOPS300K (170K by vendor)170K

Over time sequential and random reads results are displayed in sequence below:

The sequential read workloads performed as if they were performed on the NVMe devices while the random reads experienced an almost 44% performance drop. Because the drop is considerable, random read tests were run again on the bare NVMe drive and confirmed to be similar to before: 780k.

During the execution of the sequential write workload, it was noticed that Fio had some sort of “pausing behavior“, which is reflected in the over time graph for the three DRBD devices:

and inconsistent job execution for the three DRBD devices:

It was also noticeable that while the writes were occurring, the DRBD resource status was never reaching full synchronization, that is, reaching an “UptoDate“ status. Observability helped identify multiple TCP retransmissions and suboptimal TCP memory settings:

Kernel Tuning

Because of the TCP retransmissions experienced during the sequential test, the following changes have been applied to the kernel:

sysctl -w net.ipv4.tcp_slow_start_after_idle=0
sysctl -w net.core.rmem_max=56623104
sysctl -w net.core.wmem_max=56623104
sysctl -w net.core.rmem_default=56623104
sysctl -w net.core.wmem_default=56623104
sysctl -w net.core.optmem_max=56623104
sysctl -w net.ipv4.tcp_rmem='4096 87380 56623104'
sysctl -w net.ipv4.tcp_wmem='4096 65536 56623104'

After applying the kernel tuning, it was possible to obtain maximum throughput from the DRBD devices:

The last graph shows that the TCP memory utilized is far from reaching the custom values previously applied to the kernel.

Random write tests did not experience any drop during execution and reached an absolute result of around 170K IOPS.

DRBD Tuning

MTU

Increasing the MTU to 9000 did not cause any improvement. As per the DRBD user manual, it is possible to get an idea of the latency difference between the DRBD device and the NVMe device using a single threaded dd command:

#  dd if=/dev/zero of=/dev/drbd1 bs=4k count=100000 oflag=direct
100000+0 records in
100000+0 records out
409600000 bytes (410 MB, 391 MiB) copied, 5.20906 s, 78.6 MB/s
# drbdadm down r0
#  dd if=/dev/zero of=/dev/nvme0n1 bs=4k count=100000 oflag=direct
100000+0 records in
100000+0 records out
409600000 bytes (410 MB, 391 MiB) copied, 1.15903 s, 353 MB/s

CPU Mask

Because it is possible to observe CPU0 being used more than other CPUs, it seemed reasonable to allocate 24 CPUs to the DRBD resource, because 64 CPUs are available (with HT enabled):

  options {
    cpu-mask FFFFFF;
    ...
  }

During the write workload it is possible to observe the utilization of only one CPU, probably the one used by DRBD resource:

After setting up CPU mask to use the first 24 CPUs and let Fio jobs use the others: cpus_allowed=24-63, the CPUs consumption seems better distributed:

Finally the same CPUs masked by DRBD have been utilized by the Fio jobs:

In both cases, it was not possible to observe an improvement of the random test results.

Read Balancing

Read tests were run again after enabling read balancing feature in “round robin“ mode across multiple paths:

  [...]
  disk {
    read-balancing round-robin;
  }
  [...]
    connection {
    path {
      host "idnode10.e4red" address 192.168.100.110:7900;
      host "idnode12.e4red" address 192.168.100.112:7900;
    }
    path {
      host "idnode10.e4red" address 192.168.100.111:7900;
      host "idnode12.e4red" address 192.168.100.113:7900;
    }

As expected, enabling read balancing improved sequential read performance transfer speed by approximately 40%, but a reduction to around 190K IOPS with the random read workload was observed:

External Metadata

Random tests were run again with an external metadata device, to verify if this could improve the test results:

[...]
  meta-disk /dev/nvme1n1;
[...]

While using external metadata device nvme1n1, we can observe that reads occur only on the DRBD disk device and performances seem pretty much the same as without external metadata. Only during the write workload was it possible to notice some writes happening on the external metadata device with a slight decrease in latency values:

External Metadata

Internal Metadata

Scaling Random Performances

Further tests were carried out in order to understand why there is such a performance degradation while executing random workloads. During these tests the focus was on a single NVMe device resource, first the KIOXIA KCD6 and then Scaleflux CSDS drives. The following DRBD configuration was in place:

resource "r0" {
  device /dev/drbd1;
  disk "/dev/nvme0n1";
  meta-disk internal;
  protocol C;

  net {
    transport "tcp";
    load-balance-paths yes;
    rcvbuf-size 10M;
    sndbuf-size 10M;
    max-buffers 20k;
  }

  on "idnode10.e4red" {
    node-id    10;
  }
  on "idnode12.e4red" {
    node-id    12;
  }
  connection {
    path {
      host idnode10.e4red address 192.168.100.110:7000;
      host idnode12.e4red address 192.168.100.112:7000;
    }
    path {
      host idnode10.e4red address 192.168.100.111:7000;
      host idnode12.e4red address 192.168.100.113:7000;
    }
  }
}

The following Fio tests were used, executing random read and write workloads scaling from 1 to 64 workers per Fio job:

[global]
description=DRBD random
ioengine=libaio
time_based=1
startdelay=60
ramp_time=60
runtime=300
direct=1

📝 NOTE: The filename property will change according to the following devices:

  • drbd1 based on the KIOXIA KCD61VUL3T20
  • drbd2 based on the Scaleflux CSDU5SPC38
  • nvme0n1 the actual KIOXIA KCD61VUL3T20
  • nvme2n1 the actual Scaleflux CSDU5SPC38

All the graphs that follow report the results for reads (below 0 on the x axis) and writes (above 0 on the x axis) in sequence. Each of the Fio jobs last five minutes in the following order: 1, 8, 16, 32, 48 and 64 workers.

KIOXIA KCD61VUL3T20

In the first image below, graphs were generated while running the tests on the NVMe bare device and in the second image, graphs were generated from tests run on the DRBD device on top of that NVMe drive:

Reads

In the graphs above, it is possible to see that in the first five minutes of the single worker Fio job, the IOPS for the DRBD device, are similar to the bare NVMe drive. A noticeable difference starts to appear when we increase the number of workers to 8: the second five minutes frame. The NVMe drive exploits its parallelization capabilities and handles the multiple workers Fio jobs at a constant IOPS rate . Final observation: DRBD seems to perform slightly better, adding around 50K IOPS while still handling eight workers and stabilizes for the rest of the Fio jobs at 16, 32, 48 and 64 Fio workers.

Looking at the latency two things are striking from the graphs above. The first thing is that the second graph shows that the DRBD device latency increases more than the NVMe drive latency with the increase of the Fio workers. The second thing is that the NVMe device increases latency more than the DRBD device while providing higher IOPS to more workers reaching above 1ms. The NVMe drive reaches maximum IOPS between 11:50 and 11:55 while the DRBD device reaches its maximum between 10:05 and 10:10, both while servicing eight Fio workers. If we have to compare the I/O latency, the NVMe drive achieves 159us and the DRBD device 125us:

Testing the NVMe drive, the more I/O requests waiting in queue are a consequence of increasing the Fio workers, each worker configured to use a queue depth of 16. The queued I/O is in line with the expected queue values: 1024 for 64 workers, 768 for 48, and so on. For the DRBD device this rule is valid only for the single worker node (remember that IOPS and latency are at the same values as for the NVMe drive at this stage). The queue size increases at a much slower pace.

Finally, looking at the second graph, it shows the DRBD device I/O queue increasing after the single worker Fio job at a slower pace that eventually becomes higher than the NVMe drive I/O queue.

The second graph clearly shows the impact of the multiple workers jobs on the DRBD device. The node was dedicated to run only these Fio tests and the DRBD engine. Because the graph on the top does not have such CPU utilization, this suggests that the CPU system and IRQ servicing time is mostly caused by DRBD.

Writes

As shown in the top graph, the NVMe drive performance can reach peaks of 390K IOPS while a more realistic and consistent number is about 170K IOPS. The DRBD device could reach its maximum performances while running 8, 16 and 32 workers Fio jobs. This is perfectly in line with the NVMe drives consistent performance (and vendor spec sheet values). As for the reads, using 48 and 64 workers, the performance drops to about 130K IOPS.

By looking at the latency graphs, it becomes clear that the latency comes from the DRBD device and not the underlying storage. Here, the sync replica must be taken into account. As for the IOPS, the DRBD device can maintain a very stable and consistent flat line that grows with the growth of the Fio workers count. Because the NVMe drive and DRBD device IOPS are the same for the second and third graphs below, the increase of latency starting at 11:00 (48 and 64 workers) for the NVMe drive, could be the reason why there is a drop in IOPS on the DRBD device.

In contrast with the read workloads, the queue size of the DRBD device is almost identical to that of the NVMe drive.

Also, the impact on the CPU is negligible during writes as opposed to read workloads.

Conclusion

The tests showed how DRBD is an efficient replication engine even when using NVMe drives. Albeit the tests have been carried out only with a single NVMe device, it was possible to reach its maximum transfer speed with a negligible CPU consumption. Configurations with more NVMe drives as stand alone resources or aggregated via specialized RAID engines, will be part of the next tests.

Thanks to the read balancing feature it was even possible to almost double the performance of the sequential read workload. This feature, however, is not suggested for random workloads.

Random writes performed as good as the drive spec sheet with a slight drop when running Fio jobs with 48 and 64 workers.

Albeit it was not possible to reach the drive maximum IOPS during random reads workload with higher Fio workers count, the DRBD device latency remained below the 400us with 64 worker nodes. The high queuing of the DRBD device in contrast with the underlying NVMe drive suggests that this behavior is a consequence of the DRBD engine design.

CPU masking is probably a feature to further explore while testing multiple drives or real life workloads.

DRBD Functionality

DRBD was very easy to implement and configure for a single resource. Its CLI is easy to grasp and once there is a good overview of the overall features, it’s not so hard to apply changes to the drbd.conf files, thanks to the excellent man pages made available.

Further testing needs to be carried out employing RDMA either by using the Linux OFED native drivers or by installing MOFED and re-compiling DRBD.

Prometheus node exporter includes DRBD extension however, some DRBD versions ago, stats are not populated in /proc/drbd anymore thus the exporter metrics are empty. To solve this, it is possible to install DRBD Reactor Prometheus plugin.

Other features such as disaster recovery, trim, device verification and traffic integrity settings are yet to be explored.

Talk to us

LINBIT is committed to protecting and respecting your privacy, and we’ll only use your personal information to administer your account and to provide the products and services you requested from us. From time to time, we would like to contact you about our products and services, as well as other content that may be of interest to you. If you consent to us contacting you for this purpose, please tick above to say how you would like us to contact you.

You can unsubscribe from these communications at any time. For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.

By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.

Talk to us

LINBIT is committed to protecting and respecting your privacy, and we’ll only use your personal information to administer your account and to provide the products and services you requested from us. From time to time, we would like to contact you about our products and services, as well as other content that may be of interest to you. If you consent to us contacting you for this purpose, please tick above to say how you would like us to contact you.

You can unsubscribe from these communications at any time. For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.

By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.