We often see people on
#drbd or on
drbd-user trying to measure the performance of their setup. Here are a few best practices to do this.
First, a few facts.
- The synchronization rate shown in
/proc/drbdhas nothing to do with the replication rate. These are different things, don’t mistake the
speed:value there for a performance indicator.
- Use an appropriate tool.
ddwith default settings and
cpdon’t write to the device, but only into the Linux buffers at first – so timing these won’t tell you anything about your storage performance.
- The hardest discipline is single-threaded, io-depth 1. Here every access has to wait for the preceding to finish, so each bit of latency will bite you hard.
Getting some bandwidth with four thousand concurrent writes is easy!
- Network benchmarking isn’t that easy, either.
iperfwill typically send only NULs; checksum offloading might hide or create problems; switches, firewalls, etc. will all introduce noise.
What you want to do:
- Start at the bottom of the stack. Measure (and tune) the LV that DRBD will sit upon, then the network, then DRBD.
- Our suggestion is still to use a direct connection, ie. a crossover cable.
- If you don’t have any data on the device, test against the block device. A filesystem on top will create additional meta-data load and barriers, this can severely affect your IOPs. (Especially on rotating media.)
- Useful tools are
fio direct=1, and for a basic single-threaded io-depth=1 run you can use
dd oflag=direct(for writes, when reading set
bs=4096is nice to measure the IOPs,
bs=1Mwill give you the bandwidth.
- Get enough data. Running
ddwith settings that make it finish within 0.5 seconds means that you are likely to suffer from outliers, make it run 5 seconds or longer!
fiohas the nice
runtimeparameter, just let it run 20 seconds to have some data.
- For any unexpected result try to measure again a minute later, then think hard what could be wrong and where your clusters bottlenecks are.
Some problems we’ve seen:
- Misaligned partitions (sector 63, anyone?) might hurt you plenty. Really.
If you suffer from that, get the secondary correctly aligned, switch over, and re-do the previous primary node.
iperfgoes fast, but a connected DRBD doesn’t: try turning off the offloading on the network cards; some will trash the checksum for non-zero data, and that means retransmissions.
- Some RAID controllers can be tuned – to either IOPs or bandwidth. Sounds strange, but we have seen such effects.
- Concurrent load – trying to benchmark the storage on your currently active database machine is not a good idea.
- Broken networks should be looked for even if there are no error counters on the interface. Recently a pair started to connect just fine, but then couldn’t even synchronize with a meagre 10MiByte/sec…
The best hint was the
ethtooloutput that said
Speed: 10MBit; switching cables did resolve that issue.
If you’re doing all that correctly, and are using a recent DRBD version, for a pure random-write IO you should only see 1-3% difference between the lower-level LV directly and a connected DRBD.
Here’s an example fio call.
fio --name $name --filename $dev --ioengine libaio --direct 1 \ --rw randwrite --bs 4k --runtime 30s --numjobs $threads \ --iodepth $iodepth --append-terse