LINBIT® has been building and supporting high-availability (HA) iSCSI clusters using DRBD® and Pacemaker for over a decade. In fact, that was the very first HA cluster I built for a client when I started working at LINBIT as a support engineer back in 2014. Searching the Internet for HA iSCSI Pacemaker clusters will return a lot of results, but none of them will show you how you can configure multipathing for iSCSI when using Pacemaker, which is why I’m writing this blog today.
❗ IMPORTANT: This blog is not describing using DRBD in dual-primary mode to multipath a client (initiator) system to separate DRBD nodes. You should never try to do that. This blog is describing multipathing between an initiator and the iSCSI cluster’s single DRBD primary node. I mention this because that subject comes up A LOT.
In case you’re uninitiated, iSCSI multipathing is a technology that enables redundant and load-balanced connections between an iSCSI initiator and an iSCSI target. It allows multiple network paths to be used simultaneously for data transmission, bolstering both fault tolerance and performance. An HA iSCSI cluster without multipathing can keep your data available when there are failures on the target servers, but this does nothing for a dead switch between the target and initiator, if that switch is the only connection between them.
The Pacemaker Configuration
A basic HA iSCSI Pacemaker configuration as outlined in LINBIT’s Highly Available iSCSI Storage With DRBD And Pacemaker On RHEL 8 how-to guide will have you configure a single virtual IP (VIP) address that floats between the peers that make up the HA cluster. A client can use the VIP to attach to the active node in the cluster. This basic configuration also uses a Pacemaker resource group to order and colocate all the different primitives needed to create an HA iSCSI target.
The main differences to note in a Pacemaker configuration that can support multipathing is a second (or n-th) VIP address, and all of the VIP addresses’ iSCSI sockets listed in the portals
parameter on the iSCSITarget
primitive. Optionally, “long form” resource colocation and location constraints can be used along with a resource set to logically “group” the VIP addresses. The resource set’s “grouping” of the VIP addresses will start the VIP addresses in parallel, but only require one to fully start before Pacemaker continues to start services according to their ordering constraints. I say the long form ordering is optional because you technically can put all your IP addresses into a resource group, but this will start the virtual IP addresses sequentially, which isn’t as efficient. Also, resource sets in Pacemaker are fairly niche, at least in my experience, so maybe this will help someone searching the internet for examples in both crmsh and pcs syntax.
📝 NOTE: Each of the multiple VIP addresses should exist on completely separate networks that do not share any single points of failure. In the case of a complete server failure, Pacemaker and DRBD will allow services to automatically failover to the peer server.
The network interfaces on the iSCSI cluster that will be used for the iSCSI initiator and target traffic, and where the VIP addresses will be assigned, are as follows:
# ip addr show enp0s8
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:3f:44:ef brd ff:ff:ff:ff:ff:ff
inet 192.168.222.32/24 brd 192.168.222.255 scope global noprefixroute enp0s8
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:fe3f:44ef/64 scope link
valid_lft forever preferred_lft forever
# ip addr show enp0s9
4: enp0s9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:d1:a5:bc brd ff:ff:ff:ff:ff:ff
inet 192.168.221.32/24 brd 192.168.221.255 scope global noprefixroute enp0s9
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:fed1:a5bc/64 scope link
valid_lft forever preferred_lft forever
Here is the HA iSCSI cluster configuration that supports multipathing in crmsh
syntax:
primitive p_drbd_r0 ocf:linbit:drbd \
params drbd_resource=r0 \
op start interval=0s timeout=240 \
op promote interval=0s timeout=90 \
op demote interval=0s timeout=90 \
op stop interval=0s timeout=100 \
op monitor interval=29 role=Master \
op monitor interval=31 role=Slave
primitive p_iscsi_lun_0 iSCSILogicalUnit \
params target_iqn="iqn.2017-01.com.linbit:drbd0" implementation=lio-t \
scsi_sn=aaaaaaa0 lio_iblock=0 lun=0 path="/dev/drbd0" \
op start interval=0 timeout=20 \
op stop interval=0 timeout=20 \
op monitor interval=20 timeout=40
primitive p_iscsi_portblock_off_0 portblock \
params portno=3260 protocol=tcp action=unblock \
op start timeout=20 interval=0 \
op stop timeout=20 interval=0 \
op monitor timeout=20 interval=20
primitive p_iscsi_portblock_on_0 portblock \
params portno=3260 protocol=tcp action=block \
op start timeout=20 interval=0 \
op stop timeout=20 interval=0 \
op monitor timeout=20 interval=20
primitive p_iscsi_target_0 iSCSITarget \
params iqn="iqn.2017-01.com.linbit:drbd0" implementation=lio-t \
portals="192.168.222.35:3260 192.168.221.35:3260" \
op start interval=0 timeout=20 \
op stop interval=0 timeout=20 \
op monitor interval=20 timeout=40
primitive p_vip_0a IPaddr2 \
params ip=192.168.222.35 cidr_netmask=24 \
op start interval=0 timeout=20 \
op stop interval=0 timeout=20 \
op monitor interval=10s
primitive p_vip_0b IPaddr2 \
params ip=192.168.221.35 cidr_netmask=24 \
op start interval=0 timeout=20 \
op stop interval=0 timeout=20 \
op monitor interval=10s
ms ms_drbd_r0 p_drbd_r0 \
meta master-max=1 master-node-max=1 notify=true clone-max=3 clone-node-max=1
colocation cl_p_iscsi_lun_0-with_p_iscsi_target_0 inf: p_iscsi_lun_0 p_iscsi_target_0
colocation cl_p_iscsi_portblock_off_0-with-p_iscsi_lun_0 inf: p_iscsi_portblock_off_0 p_iscsi_lun_0
colocation cl_p_iscsi_portblock_on_0-with-ms_drbd_r0 inf: p_iscsi_portblock_on_0 ms_drbd_r0:Master
colocation cl_p_iscsi_target_0-with-p_vip_0a inf: p_iscsi_target_0 [ p_vip_0a p_vip_0b ]
colocation cl_p_vips-with-p_iscsi_portblock_on_0 inf: [ p_vip_0a p_vip_0b ] p_iscsi_portblock_on_0
order o_ms_drbd_r0-before_p_iscsi_portblock_on_0 ms_drbd_r0:promote p_iscsi_portblock_on_0
order o_p_iscsi_lun_0-before_p_iscsi_portblock_off_0 p_iscsi_lun_0 p_iscsi_portblock_off_0
order o_p_iscsi_portblock_on_0-before_p_vip_0a p_iscsi_portblock_on_0 [ p_vip_0a p_vip_0b ]
order o_p_iscsi_target_0-before_p_iscsi_lun_0 p_iscsi_target_0 p_iscsi_lun_0
order o_p_vips-before-p_iscsi_target_0 [ p_vip_0a p_vip_0b ] p_iscsi_target_0
Or, if for some reason you need more angle brackets in your life, here is the same configuration in pcs
XML syntax:
<cib [...]>
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair name="stop-all-resources" value="false" id="cib-bootstrap-options-stop-all-resources"/>
<nvpair name="stonith-enabled" value="false" id="cib-bootstrap-options-stonith-enabled"/>
</cluster_property_set>
</crm_config>
<nodes>
<node id="3" uname="iscsi-2"/>
With the above configuration set in Pacemaker the cluster monitor will look like this:
[root@iscsi-2 ~]# crm_mon -1r
Cluster Summary:
* Stack: corosync
* Current DC: iscsi-2 (version 2.0.5.linbit-1.0.el8-ba59be712) - partition with quorum
* Last updated: Tue Oct 3 18:46:03 2023
* Last change: Sat Sep 30 06:18:10 2023 by hacluster via crmd on iscsi-1
* 3 nodes configured
* 9 resource instances configured
Node List:
* Online: [ iscsi-0 iscsi-1 iscsi-2 ]
Full List of Resources:
* p_iscsi_lun_0 (ocf::heartbeat:iSCSILogicalUnit): Started iscsi-2
* p_iscsi_portblock_off_0 (ocf::heartbeat:portblock): Started iscsi-2
* p_iscsi_portblock_on_0 (ocf::heartbeat:portblock): Started iscsi-2
* p_iscsi_target_0 (ocf::heartbeat:iSCSITarget): Started iscsi-2
* p_vip_0a (ocf::heartbeat:IPaddr2): Started iscsi-2
* p_vip_0b (ocf::heartbeat:IPaddr2): Started iscsi-2
* Clone Set: ms_drbd_r0 [p_drbd_r0] (promotable):
* Masters: [ iscsi-2 ]
* Slaves: [ iscsi-0 iscsi-1 ]
Notice the iSCSI target is currently running on the host named iscsi-2
. As shown in the Pacemaker configuration, the p_vip_0a
and p_vip_0b
VIP resources are configured with the IP addresses, 192.168.222.35/24
and 192.168.221.35/24
, respectively. Those are the IP addresses the iSCSI target is listening on.
Inspecting the interfaces on iscsi-2
will show the VIP addresses assigned to their respective interfaces:
# ip addr show enp0s8
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:3f:44:ef brd ff:ff:ff:ff:ff:ff
inet 192.168.222.32/24 brd 192.168.222.255 scope global noprefixroute enp0s8
valid_lft forever preferred_lft forever
inet 192.168.222.35/24 brd 192.168.222.255 scope global secondary enp0s8
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:fe3f:44ef/64 scope link
valid_lft forever preferred_lft forever
# ip addr show enp0s9
4: enp0s9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:d1:a5:bc brd ff:ff:ff:ff:ff:ff
inet 192.168.221.32/24 brd 192.168.221.255 scope global noprefixroute enp0s9
valid_lft forever preferred_lft forever
inet 192.168.221.35/24 brd 192.168.221.255 scope global secondary enp0s9
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:fed1:a5bc/64 scope link
valid_lft forever preferred_lft forever
Connecting an iSCSI Initiator to the Target Cluster
Connecting an iSCSI initiator to an iSCSI target using multipathing is as easy as connecting a non-multipathed initiator and target. As mentioned earlier, the iSCSI target and initiator systems should be connected to two or more networks that do not share components. The network interfaces configured on the initiator system look like this:
$ ip addr show eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:68:17:3d brd ff:ff:ff:ff:ff:ff
altname enp0s8
inet 192.168.222.254/24 brd 192.168.222.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:fe68:173d/64 scope link
valid_lft forever preferred_lft forever
$ ip addr show eth2
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:c4:e0:19 brd ff:ff:ff:ff:ff:ff
altname enp0s9
inet 192.168.221.254/24 brd 192.168.221.255 scope global eth2
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:fec4:e019/64 scope link
valid_lft forever preferred_lft forever
$ ip addr show eth3
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:ea:11:61 brd ff:ff:ff:ff:ff:ff
altname enp0s10
inet 192.168.220.254/24 brd 192.168.220.255 scope global eth3
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:feea:1161/64 scope linkhttps://crmsh.github.io/man-2.0/#topics_Features_Resourcesets
valid_lft forever preferred_lft forever
Interfaces eth1
and eth2
are on the same networks as the iSCSI target cluster, and eth3
is yet another network that is used to access services hosted on the initiator system.
Performing an iSCSI discovery from the initiator against a single target VIP will show all available targets on all of the available paths:
$ sudo iscsiadm --mode discovery -t st -p 192.168.222.35
192.168.222.35:3260,1 iqn.2017-01.com.linbit:drbd0
192.168.221.35:3260,1 iqn.2017-01.com.linbit:drbd0
Logging into both targets by using the VIP address and port (socket) shown in the discovery output will connect the initiator to the targets over multiple paths:
$ sudo iscsiadm --mode node -T iqn.2017-01.com.linbit:drbd0 -p 192.168.222.35:3260 -l
Logging in to [iface: default, target: iqn.2017-01.com.linbit:drbd0, portal: 192.168.222.35,3260]
Login to [iface: default, target: iqn.2017-01.com.linbit:drbd0, portal: 192.168.222.35,3260] successful.
$ sudo iscsiadm --mode node -T iqn.2017-01.com.linbit:drbd0 -p 192.168.221.35:3260 -l
Logging in to [iface: default, target: iqn.2017-01.com.linbit:drbd0, portal: 192.168.221.35,3260]
Login to [iface: default, target: iqn.2017-01.com.linbit:drbd0, portal: 192.168.221.35,3260] successful.
Next, you can verify that multipathing is working by using multipath -ll
command:
$ sudo multipath -ll
mpatha (36001405aaaaaaa000000000000000000) dm-1 LIO-ORG,p_iscsi_lun_0
size=8.0G features='0' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=enabled
| `- 3:0:0:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=50 status=enabled
`- 4:0:0:0 sdc 8:32 active ready running
On the target system where the iSCSI target is currently running, you can look at the TCP sessions and see an established session on each of the VIP addresses from the initiator’s respective IP addresses.
# ss -tn
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 0 192.168.222.35:3260 192.168.222.254:51856
ESTAB 0 0 192.168.221.35:3260 192.168.221.254:46682
From here the multipath device can be used on the initiator system through device mapper. The device mapper name in this example, which can seen in the multipath -ll
output above, is mpatha
and can be used as /dev/mapper/mpatha
or configured further depending on your needs. When a single path fails between the target and initiator, such as a switch or network interface card (NIC) failure, the initiator system will be able to continue reading and writing from the target cluster. If the active node in the iSCSI target cluster fails, the iSCSI target will seamlessly fail over to another node in the cluster without interruption to reads or writes from the initiator, thanks to Pacemaker and DRBD.
The Testing Environment
Testing was done using a MinIO server on the iSCSI initiator. Link failures between the iSCSI initiator and target were simulated by using a script to “unplug” and “plug in” the network cables one at a time from the initiator system on the system’s hypervisor. Uploads of a large ISO image were looped from another system to a MinIO bucket backed by the iSCSI cluster’s HA iSCSI target volume. When the network cable was “unplugged” between the iSCSI initiator and target cluster, a brief decrease in throughput on the ISO upload occurred before multipathing would mark the path as faulty allowing all I/O to continue over the single remaining path.
Closing Thoughts
The upside of using multipathing between an iSCSI target and initiator whenever possible should be pretty apparent. When combining multipathing with DRBD and Pacemaker, even higher availability can be achieved than when using just one or the other.
Whether you stumbled across this blog while looking for resources specific to HA iSCSI multipathing in Linux or some smaller tidbit of information within it, I hope you did find it helpful. If you happen to be building a storage system using DRBD and need some pointers from the creators, don’t hesitate to reach out directly, or consider joining the LINBIT community where you can share any thoughts or questions with me and other users of LINBIT’s open source clustering software.