LINBIT featured image

RDMA Performance

As promised in the previous RDMA post, we gathered some performance data for the RDMA transport. Read and enjoy!

Basic hardware information:

  • Two IBM 8247-22L’s (Power8, 2 sockets * 10 CPUs, hyperthreading turned off)
  • 128GiByte RAM
  • ConnectX4 Infiniband, two connections with 100Gbit each
  • The DRBD TCP connection was run across one “bnx2x” 10Gbit adapter pair (ie. one in each server, no bonding)
  • dm-zero as we don’t have fast enough storage available; directly on hardware there was no IO scheduler, within the VM we switched to “noop“.
    NOTE: if you’d like to see performance data with real, persistent storage being used, check out our newest Tech Guide – “DRBD9 on an Ultrastar SN150 NVMe SSD“.

Software we used:

  • Ubuntu Xenial (not officially released at the time of testing)
  • Linux Kernel version 4.4.0-15-generic (ppc64el)
  • DRBD 9.0.1-1 (ded61af75823)
  • DRBD RDMA transport 2.0.0
  • fio version 2.2.10-1

Our underlying block devices were built to have some “persistent” (ha!) space at the beginning and the end, to keep the DRBD and filesystem superblocks; the rest in the middle was mapped to dm-zero:

zero-block-1: 0 8192 linear 1:1 0
zero-block-1: 8192 2147483648 zero
zero-block-1: 2147491840 8192 linear 1:1 8192

Due to the large number of variables and potential test cases, we restricted fios run-time to 10 seconds (as that should be good enough for statistical purposes, see below).

The graphics only show data for a single thread (but for multiple IO-depths), for ease of reading.

For the performance point – here is fio directly on the hardware (i.e. without a virtualization layer in between).

DRBD9, connected via RDMA over 100Gbit IB, writing to dm-zero

This graphic shows a few points that should be highlighted:

  • For small block sizes (4kiB & 16kiB), the single-threaded/io-depth=1 performance is about 10k IOPsfootnote:[That, times 10 seconds, amounts to 100,000 measurements – a nice statistical base, I believe ;)] : with io-depth=2 it’s 20k IOPs, and when io-depth is 8 or higher, we reach top performance of ~48k IOPs.
  • For large block sizes, the best bandwidth result is a tad above 11GiB/sec (sic!)
  • Last, but not least, the best latency was below 40µsec! For two threads, io-depth=2, 4KiB block size we had this result:
      lat (usec): min=39, max=4038, avg=97.44, stdev=36.22

As a small aside, here’s the same setup, but using TCP instead of RDMA; we kept the same scale to make comparison easier.

DRBD9, connected via TCP on 10Gbit, writing to dm-zero

As you can see, copying data around isn’t that efficient – TCP is clearly slower, topping out at 1.1GiB/sec on this hardware. (But I have to admit, apart from tcp_rmem and tcp_wmem I didn’t do any tuning here either).

Now, we move on to results from within a VM; let’s start with reading.

The VM sees the DRBD device as /dev/sdb; the scheduler was set to “noop” to not interfere with the read IOs.

Reading in a VM, DRBD handled in Hypervisor

Here we get quite nice results, too:

  • 3.2GiB/sec, within the VM, should be “good enough” for most purposes, right?
  • ~20k IOPs for some io-depth, and still 3.5k IOPs with sequential IO is still better than using harddisks on hardware.

Our next milestone is writing…

Write requests have additional constraints (compared to reading) – every single write request done in the VM has to be replicated (and confirmed) in DRBD in the Hypervisor before the okay is relayed to the VM’s application.

Writing from VM, DRBD in Hypervisor

The most visible difference is the bandwidth – it tops out at ~1.1GiB/sec.

Now, these bandwidths were measured in a hyperconverged setup – the host running the VM has a copy of the data available locally. As that might not always be the case, I detached this LV, and tested again.

So, if the hypervisor does not have local storage (but always has to ask some other node), we get these pictures:

Reading within a VM, remote storage only Writing from a VM, remote storage only

As we can see, the results are mostly the same – apart from a bit of noise, the limiting factor here is the virtualization bottleneck, not the storage transport.

The only thing left now is a summary and conclusion…

  • We lack the storage speed in our test setup[1. If you’d like to see performance data with real, persistent storage being used, check out our newest Tech Guide – “DRBD9 on an Ultrastar SN150 NVMe SSD“.]:
    Even now, without multi-queue capable DRBD, we can already utilize the full 100Gbit Infiniband RDMA bandwidth  and every performance optimizations will only move the parallelity and blocksizes needed to reach line speed down to more common values.
  • VM performance is probably acceptable already
    If you need performance above the now available range (3.2GiB/sec reading, 1.1GiB/sec writing), you’ll want to put your workload on hardware anyway.
  • Might get much faster still, by using DRBD within the VM but removing the virtualization delay.
    As the used 4.4 kernel does not yet support SR-IOV for the ConnectX-4 cards, we couldn’t test that yet footnote:[Support for SR-IOV should be in the 4.5 series, though…]. In theory this should give approximately the same speed in the VM as on hardware, as the OS running in the VM should be able to read and write data directly to/from the remote storage nodes…

I guess we’ll need to do another follow-up in this series later on … 😉

We now have the Tech Guide for RDMA performance with non-volatile storage available online.

Just head over to the LINBIT Tech Guide area and read the HGST Ultrastar SN150 NVMe performance report! (Free registration required.)

Questions? Contact!

Like? Share it with the world.

Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp
Share on vk
Share on reddit
Share on email