As promised in the previous RDMA post, we gathered some performance data for the RDMA transport. Read and enjoy!
Basic Hardware Information
- Two IBM 8247-22L’s (Power8, 2 sockets * 10 CPUs, hyperthreading turned off)
- 128GiByte RAM
- ConnectX4 Infiniband, two connections with 100Gbit each
- The DRBD® TCP connection was run across one “bnx2x” 10Gbit adapter pair (i.e. one in each server, no bonding)
dm-zeroas we don’t have fast enough storage available. There was no IO scheduler directly on the hardware within the VM we switched to ‘noop‘.
NOTE: if you’d like to see performance data with actual, persistent storage being used, check out our newest Tech Guide – DRBD9 on an Ultrastar SN150 NVMe SSD.
The Software We Used
- Ubuntu Xenial (not officially released at the time of testing)
- Linux Kernel version 4.4.0-15-generic (ppc64el)
- DRBD 9.0.1-1 (ded61af75823)
- DRBD RDMA transport 2.0.0
Our underlying block devices have some ‘persistent’ (ha!) space at the beginning and the end, to keep the DRBD and filesystem superblocks; the rest in the middle was mapped to
zero-block-1: 0 8192 linear 1:1 0 zero-block-1: 8192 2147483648 zero zero-block-1: 2147491840 8192 linear 1:1 8192
Due to many variables and potential test cases, we restricted
fios run-time to 10 seconds (as that should be good enough for statistical purposes, see below).
The graphics only show data for a single thread (but for multiple IO depths), for ease of reading.
For the performance point – here is
fio directly on the hardware (i.e. without a virtualization layer in between).
DRBD9, connected via RDMA over 100Gbit IB, writing to dm-zero
This graphic shows a few points worth highlighting:
- For small block sizes (4kiB & 16kiB), the single-threaded/io-depth=1 performance is about 10k IOPsfootnote:[That, times 10 seconds, amounts to 100,000 measurements – an excellent statistical base: with io-depth=2 it’s 20k IOPs, and when io-depth is 8 or higher, we reach the top performance of ~48k IOPs.
- For large block sizes, the best bandwidth result is a tad above 11GiB/sec (sic!)
- Last but not least, the best latency was below 40µsec! For two threads, io-depth=2, 4KiB block size, we had this result:
lat (usec): min=39, max=4038, avg=97.44, stdev=36.22
Using TCP Instead Of RDMA
As a slight comparison, here’s the same setup, but using TCP instead of RDMA; we kept the same scale to make comparison easier.
DRBD9, connected via TCP on 10Gbit, writing to dm-zero
As you can see, copying data around isn’t efficient – TCP is slower, topping out at 1.1GiB/sec on this hardware. But I have to admit, apart from tcp_rmem and tcp_wmem, I didn’t do any tuning here either.
Now, we move on to results from within a VM; let’s start with reading.
The VM sees the DRBD device as /dev/sdb; we set the scheduler to ‘noop’ to not interfere with the read IOs.
Reading In A VM, DRBD Handled In Hypervisor
Here we get pretty positive results, too:
- 3.2GiB/sec, within the VM, should be “good enough” for most purposes, right?
- ~20k IOPs for some io-depth and still 3.5k IOPs with sequential IO is still better than hard disks on hardware.
Our next milestone is writing. Write requests have additional constraints (compared to reading) – every single write request done in the VM has to be replicated (and confirmed) in DRBD in the Hypervisor before the okay is relayed to the VM’s application.
Writing From VM, DRBD In Hypervisor
The most visible difference is the bandwidth – it tops out at ~1.1GiB/sec.
Now, we measured these bandwidths in a hyper-converged setup. The host running the VM has a copy of the data available locally. As that might not always be the case, I detached this LV and tested again.
So, if the hypervisor does not have local storage (but always has to ask some other node), we get these pictures:
Reading within a VM, remote storage only Writing from a VM, remote storage only
As we can see, the results are mostly the same – apart from a bit of noise, the limiting factor here is the virtualization bottleneck, not the storage transport.
The only thing left now is to summarize our findings.
- We lack the storage speed in our test setup[1. If you’d like to see performance data with actual, persistent storage being used, check out our newest Tech Guide – “DRBD9 on an Ultrastar SN150 NVMe SSD“.]:
Even now, without multi-queue capable DRBD, we can already utilize the total 100Gbit Infiniband RDMA bandwidth. Every performance optimization will only move the parallelity and block sizes needed to reach line speed to more typical values.
- VM performance is probably acceptable already
If you need performance above the available range (3.2GiB/sec reading, 1.1GiB/sec writing), you’ll want to put your workload on hardware anyway.
- It might still get faster by using DRBD within the VM but removing the virtualization delay.
As the used 4.4 kernel does not yet support SR-IOV for the ConnectX-4 cards, we couldn’t test that yet footnote:[Support for SR-IOV should be in the 4.5 series, though…]. In theory, this should give approximately the same speed in the VM as on hardware, as the OS running in the VM should be able to read and write data directly to/from the remote storage nodes.
We may have to follow up soon. In the meantime, the Tech Guide for RDMA performance with non-volatile storage is available online. Head to the LINBIT® Tech Guide area and read the HGST Ultrastar SN150 NVMe performance report! (Free registration required.)
If you have any questions about RDMA Performance or anything else, don’t hesitate to get in touch.