The DRBD9 User’s Guide
Please Read This First
This guide is intended to serve users of the Distributed Replicated Block Device version 9 (DRBD-9) as a definitive reference guide and handbook.
It is being made available to the DRBD community by LINBIT, the project’s sponsor company, free of charge and in the hope that it will be useful. The guide is constantly being updated. We try to add information about new DRBD features simultaneously with the corresponding DRBD releases. An online HTML version of this guide is always available at https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/.
This guide assumes, throughout, that you are using the latest version of DRBD and related tools. If you are using an 8.4 release of DRBD, please use the matching version of this guide from https://linbit.com/drbd-user-guide/users-guide-drbd-8-4/. |
Please use the drbd-user mailing list to submit comments.
This guide is organized as follows:
-
Introduction to DRBD deals with DRBD’s basic functionality. It gives a short overview of DRBD’s positioning within the Linux I/O stack, and about fundamental DRBD concepts. It also examines DRBD’s most important features in detail.
-
Building and Installing the DRBD Software talks about building DRBD from source, installing pre-built DRBD packages, and contains an overview of getting DRBD running on a cluster system.
-
Working with DRBD is about managing DRBD using resource configuration files, as well as common troubleshooting scenarios.
-
DRBD-enabled Applications deals with leveraging DRBD to add storage replication and high availability to applications. It not only covers DRBD integration in the Pacemaker cluster manager, but also advanced LVM configurations, integration of DRBD with GFS, and adding high availability to Xen virtualization environments.
-
Optimizing DRBD Performance contains pointers for getting the best performance out of DRBD configurations.
-
Learning More dives into DRBD’s internals, and also contains pointers to other resources which readers of this guide may find useful.
-
-
Recent Changes is an overview of changes in DRBD 9.0, compared to earlier DRBD versions.
-
Users interested in DRBD training or support services are invited to contact us at [email protected] or [email protected].
Introduction to DRBD
1. DRBD Fundamentals
DRBD is a software-based, shared-nothing, replicated storage solution mirroring the content of block devices (hard disks, partitions, logical volumes, and so on) between hosts.
DRBD mirrors data
-
in real time. Replication occurs continuously while applications modify the data on the device.
-
transparently. Applications need not be aware that the data is stored on multiple hosts.
-
synchronously or asynchronously. With synchronous mirroring, applications are notified of write completions after the writes have been carried out on all (connected) hosts. With asynchronous mirroring, applications are notified of write completions when the writes have completed locally, which usually is before they have propagated to the other hosts.
1.1. Kernel Module
DRBD’s core functionality is implemented by way of a Linux kernel module. Specifically, DRBD constitutes a driver for a virtual block device, so DRBD is situated right near the bottom of a system’s I/O stack. Because of this, DRBD is extremely flexible and versatile, which makes it a replication solution suitable for adding high availability to just about any application.
DRBD is, by definition and as mandated by the Linux kernel architecture, agnostic of the layers above it. Therefore, it is impossible for DRBD to miraculously add features to upper layers that these do not possess. For example, DRBD cannot auto-detect file system corruption or add active-active clustering capability to file systems like ext3 or XFS.
1.2. User Space Administration Tools
DRBD includes a set of administration tools which communicate with the kernel module to configure and administer DRBD resources. From top-level to bottom-most these are:
drbdadm
The high-level administration tool of the DRBD-utils program suite. Obtains all DRBD
configuration parameters from the configuration file /etc/drbd.conf
and acts
as a front-end for drbdsetup
and drbdmeta
. drbdadm
has a dry-run mode,
invoked with the -d
option, that shows which drbdsetup
and drbdmeta
calls
drbdadm
would issue without actually calling those commands.
drbdsetup
Configures the DRBD module that was loaded into the kernel. All parameters to
drbdsetup
must be passed on the command line. The separation between
drbdadm
and drbdsetup
allows for maximum flexibility. Most users will
rarely need to use drbdsetup
directly, if at all.
drbdmeta
Allows to create, dump, restore, and modify DRBD metadata structures. Like
drbdsetup
, most users will only rarely need to use drbdmeta
directly.
1.3. Resources
In DRBD, resource is the collective term that refers to all aspects of a particular replicated data set. These include:
This can be any arbitrary, US-ASCII name not containing white space by which the resource is referred to.
Beginning with DRBD 9.2.0, there is a stricter naming convention for resources. DRBD
9.2.x accepts only alphanumeric, . , + , _ , and - characters in resource names (regular expression: [0-9A-Za-z.+_-]* ). If you depend on the old behavior, it
can be brought back by disabling strict name checking: echo 0 > /sys/module/drbd/parameters/strict_names .
|
Any resource is a replication group consisting of one or more
volumes that share a common replication stream. DRBD ensures write
fidelity across all volumes in the resource. Volumes are numbered
starting with 0
, and there may be up to 65,535 volumes in one
resource. A volume contains the replicated data set, and a set of
metadata for DRBD internal use.
At the drbdadm
level, a volume within a resource can be addressed by the
resource name and volume number as resource/volume
.
This is a virtual block device managed by DRBD. It has a device major
number of 147, and its minor numbers are numbered from 0 onwards, as
is customary. Each DRBD device corresponds to a volume in a
resource. The associated block device is usually named
/dev/drbdX
, where X
is the device minor number. udev
will typically
also create symlinks containing the resource name and volume number, as in
/dev/drbd/by-res/resource/vol-nr
.
Depending on how you installed DRBD, you might need to install the drbd-udev package on
RPM based systems to install the DRBD udev rules. If your DRBD resources were created before the
DRBD udev rules were installed, you will need to manually trigger the udev rules to generate
the udev symlinks for DRBD resources, by using the udevadm trigger command.
|
Very early DRBD versions hijacked NBD’s device major number 43. This is long obsolete; 147 is the allocated DRBD device major. |
A connection is a communication link between two hosts that share a replicated data set. With DRBD 9 each resource can be defined on multiple hosts; with the current versions this requires a full-meshed connection setup between these hosts (that is, each host connected to every other for that resource).
At the drbdadm
level, a connection is addressed by the resource and the
connection name (the latter defaulting to the peer hostname), like
resource:connection
.
1.4. Resource Roles
In DRBD, every resource has a role, which may be Primary or Secondary.
The choice of terms here is not arbitrary. These roles were deliberately not named “Active” and “Passive” by DRBD’s creators. Primary compared to Secondary refers to a concept related to availability of storage, whereas active compared to passive refers to the availability of an application. It is usually the case in a high-availability environment that the primary node is also the active one, but this is by no means necessary. |
-
A DRBD device in the primary role can be used unrestrictedly for read and write operations. It may be used for creating and mounting file systems, raw or direct I/O to the block device, and so on.
-
A DRBD device in the secondary role receives all updates from the peer node’s device, but otherwise disallows access completely. It can not be used by applications, neither for read nor write access. The reason for disallowing even read-only access to the device is the necessity to maintain cache coherency, which would be impossible if a secondary resource were made accessible in any way.
The resource’s role can, of course, be changed, either by manual intervention, by way of some automated algorithm by a cluster management application, or automatically. Changing the resource role from secondary to primary is referred to as promotion, whereas the reverse operation is termed demotion.
1.5. Hardware and Environment Requirements
DRBD’s hardware and environment requirements and limitations are mentioned below. DRBD can work with just a few KiBs of physical storage and memory, or it can scale up to work with several TiBs of storage and many MiBs of memory.
1.5.2. Required Memory
DRBD needs about 32MiB of RAM per 1TiB of storage[1]. So, for DRBD’s maximum amount of storage (1PiB), you would need 32GiB of RAM for the DRBD bitmap alone, even before operating system, userspace, and buffer cache considerations.
1.5.3. CPU Requirements
DRBD 9 is tested to build for the following CPU architectures:
-
amd64
-
arm64
-
ppc64le
-
s390x
Recent versions of DRBD 9 are only tested to build on 64 bit CPU architecture. Building DRBD on 32 bit CPU architecture is unsupported and may or may not work.
1.5.4. Minimum Linux Kernel Version
The minimum Linux kernel version supported in DRBD 9.0 is 2.6.32. Starting with DRBD 9.1, the minimum Linux kernel version supported is 3.10.
1.5.5. Maximum Number of DRBD Volumes on a Node
Due to the 20 bit constraint on minor numbers, the maximum number of DRBD volumes that you can have on a node is 1048576.
1.6. FIPS Compliance
This standard shall be used in designing and implementing cryptographic modules…
Since DRBD version 9.2.6, it is possible to encrypt DRBD traffic by using the
TLS feature. However, DRBD itself does not contain
cryptographic modules. DRBD uses cryptographic modules that are available in the ktls-utils
package (used by the tlshd
daemon), or that are referenced by the
Linux kernel crypto API. In either
case, the cryptographic modules that DRBD uses to encrypt traffic will be FIPS compliant, so
long as you are using a FIPS mode enabled operating system.
If you have not enabled the TLS feature, then DRBD does not use any cryptographic modules.
In DRBD versions before 9.2.6, it was only possible to use encryption with DRBD if it was implemented in a different block layer, and not by DRBD itself. Linux Unified Key Setup (LUKS) is an example of such an implementation. You can refer to details in the LINSTOR User’s Guide about using LINSTOR as a way that you can layer LUKS below the DRBD layer.
If you are using DRBD outside of LINSTOR, it is possible to layer LUKS above the DRBD layer. However, this implementation is not recommended because DRBD would no longer be able to disklessly attach or auto-promote resources. |
2. DRBD Features
This chapter discusses various useful DRBD features, and gives some background information about them. Some of these features will be important to most users, some will only be relevant in very specific deployment scenarios. Working with DRBD and Troubleshooting and Error Recovery contain instructions on how to enable and use these features in day-to-day operation.
2.1. Single-primary Mode
In single-primary mode, a resource is, at any given time, in the primary role on only one cluster member. Since it is guaranteed that only one cluster node manipulates the data at any moment, this mode can be used with any conventional file system (ext3, ext4, XFS, and so on).
Deploying DRBD in single-primary mode is the canonical approach for High-Availability (fail-over capable) clusters.
2.2. Dual-primary Mode
In dual-primary mode a resource can be in the primary role on two nodes at a time. Since concurrent access to the data is therefore possible, this mode usually requires the use of a shared cluster file system that uses a distributed lock manager. Examples include GFS and OCFS2.
Deploying DRBD in dual-primary mode is the preferred approach for load-balancing clusters which require concurrent data access from two nodes, for example, virtualization environments with a need for live-migration. This mode is disabled by default, and must be enabled explicitly in DRBD’s configuration file.
See Enabling dual-primary mode for information about enabling dual-primary mode for specific resources.
2.3. Replication Modes
DRBD supports three distinct replication modes, allowing three degrees of replication synchronicity.
Asynchronous replication protocol. Local write operations on the primary node are considered completed as soon as the local disk write has finished, and the replication packet has been placed in the local TCP send buffer. In case of forced fail-over, data loss may occur. The data on the standby node is consistent after fail-over; however, the most recent updates performed prior to the fail-over could be lost. Protocol A is most often used in long distance replication scenarios. When used in combination with DRBD Proxy it makes an effective disaster recovery solution. See Long-distance Replication through DRBD Proxy, for more information.
Memory synchronous (semi-synchronous) replication protocol. Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has reached the peer node. Normally, no writes are lost in case of forced fail-over. However, in case of simultaneous power failure on both nodes and concurrent, irreversible destruction of the primary’s data store, the most recent writes completed on the primary may be lost.
Synchronous replication protocol. Local write operations on the primary node are considered completed only after both the local and the remote disk write(s) have been confirmed. As a result, loss of a single node is guaranteed not to lead to any data loss. Data loss is, of course, inevitable even with this replication protocol if all nodes (respective of their storage subsystems) are irreversibly destroyed at the same time.
By far, the most commonly used replication protocol in DRBD setups is protocol C.
The choice of replication protocol influences two factors of your deployment: protection and latency. Throughput, by contrast, is largely independent of the replication protocol selected.
See Configuring your resource for an example resource configuration which demonstrates replication protocol configuration.
2.4. More than Two-way Redundancy
With DRBD 9 it’s possible to have the data stored simultaneously on more than two cluster nodes.
While this has been possible before through stacking, in DRBD 9 this is supported out-of-the-box for (currently) up to 16 nodes. (In practice, using three-, four- or perhaps five-way redundancy through DRBD will make other things the leading cause of downtime.)
The major difference to the stacking solution is that there’s less performance loss, because only one level of data replication is being used.
2.5. Automatic Promotion of Resources
Prior to DRBD 9, promoting a resource would be done with the drbdadm primary
command. With DRBD
9, DRBD will automatically promote a resource to primary role when the auto-promote
option is enabled, and one of its volumes is
mounted or opened for writing. As soon as all volumes are unmounted or closed, the role
of the resource changes back to secondary.
Automatic promotion will only succeed if the cluster state allows it (that is, if an
explicit drbdadm primary
command would succeed). Otherwise, mounting or opening
the device fails as it did prior to DRBD 9.
2.6. Multiple Replication Transports
DRBD supports multiple network transports. As of now two transport implementations are available: TCP and RDMA. Each transport implementation comes as its own kernel module.
2.6.1. TCP Transport
The drbd_transport_tcp.ko
transport
implementation is included with the distribution files of drbd itself.
As the name implies, this transport implementation uses the TCP/IP
protocol to move data between machines.
DRBD’s replication and synchronization framework socket layer supports multiple low-level transports:
This is the canonical implementation, and DRBD’s default. It may be used on any system that has IPv4 enabled.
When configured to use standard TCP sockets for replication and synchronization, DRBD can use also IPv6 as its network protocol. This is equivalent in semantics and performance to IPv4, albeit using a different addressing scheme.
SDP is an implementation of BSD-style sockets for RDMA capable transports such as InfiniBand. SDP was available as part of the OFED stack of most distributions but is now considered deprecated. SDP uses an IPv4-style addressing scheme. Employed over an InfiniBand interconnect, SDP provides a high-throughput, low-latency replication network to DRBD.
SuperSockets replace the TCP/IP portions of the stack with a single, monolithic, highly efficient and RDMA capable socket implementation. DRBD can use this socket type for very low latency replication. SuperSockets must run on specific hardware which is currently available from a single vendor, Dolphin Interconnect Solutions.
2.6.2. RDMA Transport
Since DRBD version 9.2.0, the drbd_transport_rdma
kernel module is available as open source
code.
You can download the open source code from LINBIT’s tar archived DRBD releases page, or through LINBIT’s DRBD GitHub repository.
Alternatively, if you are LINBIT customer, the drbd_transport_rdma.ko
kernel module is
available in LINBIT’s customer software repositories.
This transport uses the verbs/RDMA API to move data over InfiniBand HCAs, iWARP capable NICs or RoCE capable NICs. In contrast to the BSD sockets API (used by TCP/IP) the verbs/RDMA API allows data movement with very little CPU involvement.
At high transfer rates it might be possible that the CPU load/memory bandwidth of the tcp transport becomes the limiting factor. You can probably achieve higher transfer rates using the RDMA transport with appropriate hardware.
A transport implementation can be configured for each connection of a resource. See Configuring transport implementations for more details.
2.7. Multiple Paths
DRBD allows configuring multiple paths per connection. The TCP transport uses only one path at a time for a connection, unless you have configured the TCP load balancing feature. The RDMA transport is capable of balancing the network traffic over multiple paths of a single connection. see Configuring multiple paths for more details.
2.8. Efficient Synchronization
(Re-)synchronization is distinct from device replication. While replication occurs on any write event to a resource in the primary role, synchronization is decoupled from incoming writes. Rather, it affects the device as a whole.
Synchronization is necessary if the replication link has been interrupted for any reason, be it due to failure of the primary node, failure of the secondary node, or interruption of the replication link. Synchronization is efficient in the sense that DRBD does not synchronize modified blocks in the order they were originally written, but in linear order, which has the following consequences:
-
Synchronization is fast, since blocks in which several successive write operations occurred are only synchronized once.
-
Synchronization is also associated with few disk seeks, as blocks are synchronized according to the natural on-disk block layout.
-
During synchronization, the data set on the standby node is partly obsolete and partly already updated. This state of data is called inconsistent.
The service continues to run uninterrupted on the active node, while background synchronization is in progress.
A node with inconsistent data generally cannot be put into operation, therefore it is desirable to keep the time period during which a node is inconsistent as short as possible. DRBD does, however, include an LVM integration facility that automates the creation of LVM snapshots immediately before synchronization. This ensures that a consistent copy of the data is always available on the peer, even while synchronization is running. See Using Automated LVM Snapshots During DRBD Synchronization for details on using this facility. |
2.8.1. Variable-rate Synchronization
In variable-rate synchronization (the default since 8.4), DRBD detects the available bandwidth on the synchronization network, compares it to incoming foreground application I/O, and selects an appropriate synchronization rate based on a fully automatic control loop.
See Variable Synchronization Rate Configuration for configuration suggestions with regard to variable-rate synchronization.
2.8.2. Fixed-rate Synchronization
In fixed-rate synchronization, the amount of data shipped to the synchronizing peer per second (the synchronization rate) has a configurable, static upper limit. Based on this limit, you may estimate the expected sync time based on the following simple formula:
tsync is the expected sync time. D is the amount of data to be synchronized, which you are unlikely to have any influence over (this is the amount of data that was modified by your application while the replication link was broken). R is the rate of synchronization, which is configurable — bounded by the throughput limitations of the replication network and I/O subsystem.
See Configuring the Rate of Synchronization for configuration suggestions with regard to fixed-rate synchronization.
2.8.3. Checksum-based Synchronization
The efficiency of DRBD’s synchronization algorithm may be further enhanced by using data digests, also known as checksums. When using checksum-based synchronization, then rather than performing a brute-force overwrite of blocks marked out of sync, DRBD reads blocks before synchronizing them and computes a hash of the contents currently found on disk. It then compares this hash with one computed from the same sector on the peer, and omits re-writing this block if the hashes match. This can dramatically cut down synchronization times in situations where a file system re-writes a sector with identical contents while DRBD is in disconnected mode.
See Configuring Checksum-based Synchronization for configuration suggestions with regard to synchronization.
2.9. Suspended Replication
If properly configured, DRBD can detect if the replication network is congested, and suspend replication in this case. In this mode, the primary node “pulls ahead” of the secondary — temporarily going out of sync, but still leaving a consistent copy on the secondary. When more bandwidth becomes available, replication automatically resumes and a background synchronization takes place.
Suspended replication is typically enabled over links with variable bandwidth, such as wide area replication over shared connections between data centers or cloud instances.
See Configuring Congestion Policies and Suspended Replication for details on congestion policies and suspended replication.
2.10. Online Device Verification
Online device verification enables users to do a block-by-block data integrity check between nodes in a very efficient manner.
Note that efficient refers to efficient use of network bandwidth here, and to the fact that verification does not break redundancy in any way. Online verification is still a resource-intensive operation, with a noticeable impact on CPU utilization and load average.
It works by one node (the verification source) sequentially calculating a cryptographic digest of every block stored on the lower-level storage device of a particular resource. DRBD then transmits that digest to the peer node(s) (the verification target(s)), where it is checked against a digest of the local copy of the affected block. If the digests do not match, the block is marked out-of-sync and may later be synchronized. Because DRBD transmits just the digests, not the full blocks, online verification uses network bandwidth very efficiently.
The process is termed online verification because it does not require that the DRBD resource being verified is unused at the time of verification. Therefore, though it does carry a slight performance penalty while it is running, online verification does not cause service interruption or system down time — neither during the verification run nor during subsequent synchronization.
It is a common use case to have online verification managed by the local cron daemon, running it, for example, once a week or once a month. See Using Online Device Verification for information about how to enable, invoke, and automate online verification.
2.11. Replication Traffic Integrity Checking
DRBD optionally performs end-to-end message integrity checking using cryptographic message digest algorithms such as MD5, SHA-1, or CRC-32C.
These message digest algorithms are not provided by DRBD, but by the Linux kernel crypto API; DRBD merely uses them. Therefore, DRBD is capable of using any message digest algorithm available in a particular system’s kernel configuration.
With this feature enabled, DRBD generates a message digest of every data block it replicates to the peer, which the peer then uses to verify the integrity of the replication packet. If the replicated block can not be verified against the digest, the connection is dropped and immediately re-established; because of the bitmap the typical result is a retransmission. Therefore, DRBD replication is protected against several error sources, all of which, if unchecked, would potentially lead to data corruption during the replication process:
-
Bitwise errors (“bit flips”) occurring on data in transit between main memory and the network interface on the sending node (which goes undetected by TCP checksumming if it is offloaded to the network card, as is common in recent implementations);
-
Bit flips occurring on data in transit from the network interface to main memory on the receiving node (the same considerations apply for TCP checksum offloading);
-
Any form of corruption due to a race conditions or bugs in network interface firmware or drivers;
-
Bit flips or random corruption injected by some reassembling network component between nodes (if not using direct, back-to-back connections).
See Configuring Replication Traffic Integrity Checking for information about how to enable replication traffic integrity checking.
2.12. Split Brain Notification and Automatic Recovery
Split brain is a situation where, due to temporary failure of all network links between cluster nodes, and possibly due to intervention by a cluster management software or human error, both nodes switched to the Primary role while disconnected. This is a potentially harmful state, as it implies that modifications to the data might have been made on either node, without having been replicated to the peer. Therefore, it is likely in this situation that two diverging sets of data have been created, which cannot be trivially merged.
DRBD split brain is distinct from cluster split brain, which is the loss of all connectivity between hosts managed by a distributed cluster management application such as Pacemaker. To avoid confusion, this guide uses the following convention:
-
Split brain refers to DRBD split brain as described in the paragraph above.
-
Loss of all cluster connectivity is referred to as a cluster partition, an alternative term for cluster split brain.
DRBD allows for automatic operator notification (by email or other means) when it detects split brain. See Split Brain Notification for details on how to configure this feature.
While the recommended course of action in this scenario is to manually resolve the split brain and then eliminate its root cause, it may be desirable, in some cases, to automate the process. DRBD has several resolution algorithms available for doing so:
-
Discarding modifications made on the younger primary. In this mode, when the network connection is re-established and split brain is discovered, DRBD will discard modifications made, in the meantime, on the node which switched to the primary role last.
-
Discarding modifications made on the older primary. In this mode, DRBD will discard modifications made, in the meantime, on the node which switched to the primary role first.
-
Discarding modifications on the primary with fewer changes. In this mode, DRBD will check which of the two nodes has recorded fewer modifications, and will then discard all modifications made on that host.
-
Graceful recovery from split brain if one host has had no intermediate changes. In this mode, if one of the hosts has made no modifications at all during split brain, DRBD will simply recover gracefully and declare the split brain resolved. Note that this is a fairly unlikely scenario. Even if both hosts only mounted the file system on the DRBD block device (even read-only), the device contents typically would be modified (for example, by file system journal replay), ruling out the possibility of automatic recovery.
Whether or not automatic split brain recovery is acceptable depends largely on the individual application. Consider the example of DRBD hosting a database. The discard modifications from host with fewer changes approach may be fine for a web application click-through database. By contrast, it may be totally unacceptable to automatically discard any modifications made to a financial database, requiring manual recovery in any split brain event. Consider your application’s requirements carefully before enabling automatic split brain recovery.
Refer to Automatic Split Brain Recovery Policies for details on configuring DRBD’s automatic split brain recovery policies.
2.13. Support for Disk Flushes
When local block devices such as hard drives or RAID logical disks have write caching enabled, writes to these devices are considered completed as soon as they have reached the volatile cache. Controller manufacturers typically refer to this as write-back mode, the opposite being write-through. If a power outage occurs on a controller in write-back mode, the last writes are never committed to the disk, potentially causing data loss.
To counteract this, DRBD makes use of disk flushes. A disk flush is a write operation that completes only when the associated data has been committed to stable (non-volatile) storage — that is to say, it has effectively been written to disk, rather than to the cache. DRBD uses disk flushes for write operations both to its replicated data set and to its metadata. In effect, DRBD circumvents the write cache in situations it deems necessary, as in activity log updates or enforcement of implicit write-after-write dependencies. This means additional reliability even in the face of power failure.
It is important to understand that DRBD can use disk flushes only when layered on top of backing devices that support them. Most reasonably recent kernels support disk flushes for most SCSI and SATA devices. Linux software RAID (md) supports disk flushes for RAID-1 provided that all component devices support them too. The same is true for device-mapper devices (LVM2, dm-raid, multipath).
Controllers with battery-backed write cache (BBWC) use a battery to back up their volatile storage. On such devices, when power is restored after an outage, the controller flushes all pending writes out to disk from the battery-backed cache, ensuring that all writes committed to the volatile cache are actually transferred to stable storage. When running DRBD on top of such devices, it may be acceptable to disable disk flushes, thereby improving DRBD’s write performance. See Disabling Backing Device Flushes for details.
2.14. Trim and Discard Support
Trim and Discard are two names for the same feature: a request to a storage
system, telling it that some data range is not being used anymore[2] and can be erased internally.
This call originates in Flash-based storages (SSDs, FusionIO cards, and so on),
which cannot easily rewrite a sector but instead have to erase and write
the (new) data again (incurring some latency cost). For more details, see for example,
the wikipedia page.
Since 8.4.3 DRBD includes support for Trim/Discard. You don’t need to configure or enable anything; if DRBD detects that the local (underlying) storage system allows using these commands, it will transparently enable them and pass such requests through.
The effect is that for example, a recent-enough mkfs.ext4
on a multi-TB volume can
shorten the initial sync time to a few seconds to minutes – just by telling
DRBD (which will relay that information to all connected nodes) that most/all
of the storage is now to be seen as invalidated.
Nodes that connect to that resource later on will not have seen the
Trim/Discard requests, and will therefore start a full resync; depending on
kernel version and file system a call to fstrim
might give the wanted result,
though.
even if you don’t have storage with Trim/Discard support, some virtual block devices will provide you with the same feature, for example Thin LVM. |
2.15. Disk Error Handling Strategies
If a hard disk that is used as a backing block device for DRBD on one of the nodes fails, DRBD may either pass on the I/O error to the upper layer (usually the file system) or it can mask I/O errors from upper layers.
If DRBD is configured to pass on I/O errors, any such errors occurring on the lower-level device are transparently passed to upper I/O layers. Therefore, it is left to upper layers to deal with such errors (this may result in a file system being remounted read-only, for example). This strategy does not ensure service continuity, and is therefore not recommended for most users.
If DRBD is configured to detach on lower-level I/O error, DRBD will do so, automatically, upon occurrence of the first lower-level I/O error. The I/O error is masked from upper layers while DRBD transparently fetches the affected block from a peer node, over the network. From then onwards, DRBD is said to operate in diskless mode, and carries out all subsequent I/O operations, read and write, on the peer node(s) only. Performance in this mode will be reduced, but the service continues without interruption, and can be moved to the peer node in a deliberate fashion at a convenient time.
See Configuring I/O Error Handling Strategies for information about configuring I/O error handling strategies for DRBD.
2.16. Strategies for Handling Outdated Data
DRBD distinguishes between inconsistent and outdated data. Inconsistent data is data that cannot be expected to be accessible and useful in any manner. The prime example for this is data on a node that is currently the target of an ongoing synchronization. Data on such a node is part obsolete, part up to date, and impossible to identify as either. Therefore, for example, if the device holds a file system (as is commonly the case), that file system would be unexpected to mount or even pass an automatic file system check.
Outdated data, by contrast, is data on a secondary node that is consistent, but no longer in sync with the primary node. This would occur in any interruption of the replication link, whether temporary or permanent. Data on an outdated, disconnected secondary node is expected to be clean, but it reflects a state of the peer node some time past. To avoid services using outdated data, DRBD disallows promoting a resource that is in the outdated state.
DRBD has interfaces that allow an external application to outdate a secondary node as soon as a network interruption occurs. DRBD will then refuse to switch the node to the primary role, preventing applications from using the outdated data. A complete implementation of this functionality exists for the Pacemaker cluster management framework (where it uses a communication channel separate from the DRBD replication link). However, the interfaces are generic and may be easily used by any other cluster management application.
Whenever an outdated resource has its replication link re-established, its outdated flag is automatically cleared. A background synchronization then follows.
2.17. Three-way Replication Using Stacking
Available in DRBD version 8.3.0 and above; deprecated in DRBD version 9.x, as more nodes can be implemented on a single level. See Defining network connections for details. |
When using three-way replication, DRBD adds a third node to an existing 2-node cluster and replicates data to that node, where it can be used for backup and disaster recovery purposes. This type of configuration generally involves Long-distance Replication through DRBD Proxy.
Three-way replication works by adding another, stacked DRBD resource on top of the existing resource holding your production data, as seen in this illustration:
The stacked resource is replicated using asynchronous replication (DRBD protocol A), whereas the production data would usually make use of synchronous replication (DRBD protocol C).
Three-way replication can be used permanently, where the third node is continuously updated with data from the production cluster. Alternatively, it may also be employed on demand, where the production cluster is normally disconnected from the backup site, and site-to-site synchronization is performed on a regular basis, for example by running a nightly cron job.
2.18. Long-distance Replication through DRBD Proxy
DRBD’s protocol A is asynchronous, but the
writing application will block as soon as the socket output buffer is
full (see the sndbuf-size
option in the man page of drbd.conf
). In that
event, the writing application has to wait until some of the data written
runs off through a possibly small bandwidth network link.
The average write bandwidth is limited by available bandwidth of the network link. Write bursts can only be handled gracefully if they fit into the limited socket output buffer.
You can mitigate this by DRBD Proxy’s buffering mechanism. DRBD Proxy will place changed data from the DRBD device on the primary node into its buffers. DRBD Proxy’s buffer size is freely configurable, only limited by the address room size and available physical RAM.
Optionally DRBD Proxy can be configured to compress and decompress the data it forwards. Compression and decompression of DRBD’s data packets might slightly increase latency. However, when the bandwidth of the network link is the limiting factor, the gain in shortening transmit time outweighs the added latency of compression and decompression.
Compression and decompression were implemented with multi core SMP systems in mind, and can use multiple CPU cores.
The fact that most block I/O data compresses very well and therefore the effective bandwidth increases justifies the use of the DRBD Proxy even with DRBD protocols B and C.
See Using DRBD Proxy for information about configuring DRBD Proxy.
DRBD Proxy is one of the few parts of the DRBD product family that is not published under an open source license. Please contact [email protected] or [email protected] for an evaluation license. |
2.19. Truck-based Replication
Truck-based replication, also known as disk shipping, is a means of preseeding a remote site with data to be replicated, by physically shipping storage media to the remote site. This is particularly suited for situations where
-
the total amount of data to be replicated is fairly large (more than a few hundreds of gigabytes);
-
the expected rate of change of the data to be replicated is less than enormous;
-
the available network bandwidth between sites is limited.
In such situations, without truck-based replication, DRBD would require a very long initial device synchronization (on the order of weeks, months, or years). Truck based replication allows shipping a data seed to the remote site, and so drastically reduces the initial synchronization time. See Using truck based replication for details on this use case.
2.20. Floating Peers
This feature is available in DRBD versions 8.3.2 and above. |
A somewhat special use case for DRBD is the floating peers configuration. In floating peer setups, DRBD peers are not tied to specific named hosts (as in conventional configurations), but instead have the ability to float between several hosts. In such a configuration, DRBD identifies peers by IP address, rather than by host name.
For more information about managing floating peer configurations, see Configuring DRBD to Replicate Between Two SAN-backed Pacemaker Clusters.
2.21. Data Rebalancing (Horizontal Storage Scaling)
If your company’s policy says that 3-way redundancy is needed, you need at least 3 servers for your setup.
Now, as your storage demands grow, you will encounter the need for additional servers. Rather than having to buy 3 more servers at the same time, you can rebalance your data across a single additional node.
In the figure above you can see the before and after states: from 3 nodes with three 25TiB volumes each (for a net 75TiB), to 4 nodes, with net 100TiB.
DRBD 9 makes it possible to do an online, live migration of the data; please see Data Rebalancing for the exact steps needed.
2.22. DRBD Client
With the multiple-peer feature of DRBD, several interesting use cases have been added, for example the DRBD client.
The basic idea is that the DRBD back end can consist of three, four, or more nodes (depending on the policy of required redundancy); but, as DRBD 9 can connect more nodes than that. DRBD works then as a storage access protocol in addition to storage replication.
All write requests executed on a primary DRBD client gets shipped to all nodes equipped with storage. Read requests are only shipped to one of the server nodes. The DRBD client will evenly distribute the read requests among all available server nodes.
See Permanently Diskless Nodes for more information.
2.23. DRBD Quorum
DRBD quorum is a feature that can help you avoid split-brain situations and data divergence in high-availability clusters. By using DRBD quorum, you do not have to resort to using fencing or STONITH solutions, although you can also use these if you want to. DRBD quorum requires at least three nodes but a third node which acts as an arbitrator can be diskless. An arbitrator node does not need to have the same hardware specifications as diskful storage nodes in your cluster. For example, a low-powered single-board computer such as a Raspberry Pi might even be sufficient.
The functional concept behind DRBD quorum is that a cluster node can only modify the DRBD replicated data set if the number of DRBD-running nodes for the data set that the node can communicate with, including the node itself, meets the requirement that you specify when you enable the quorum option. For most configurations, this will be a majority number of nodes. A majority number is more than half of the total number of DRBD-running nodes for the data set in the cluster. By only allowing data writes on a node that has network access to more than half the nodes in a given partition, including the node itself, the DRBD quorum feature helps you avoid a situation that would create a diverging data set.
It is not however the case that you always need two or more running nodes in a 3-node cluster for a DRBD primary node to be able to write to the data set. There is an exception, for example, if you disconnect all secondary nodes gracefully, then DRBD will mark their data as outdated as they leave the cluster. In this way, a single remaining DRBD primary node would “know” that it is safe to continue to write data. This situation could arise, for example, while performing system maintenance on nodes in your cluster. This way, you can maintain nodes in your cluster without causing downtime for your applications and services.
Using DRBD quorum is compatible with running a Pacemaker cluster. Pacemaker gets informed about quorum or loss-of-quorum through the master score of the DRBD resource.
There are different options and behaviors that you can configure related to the DRBD quorum feature, such as how you define quorum in your cluster and what actions DRBD might take when a node loses quorum. Refer to the Configuring Quorum section for information about this.
2.23.1. Quorum Tiebreaker
The quorum tiebreaker feature is available in DRBD versions 9.0.18 and above. |
The fundamental problem with 2-node clusters is that in the moment they lose connectivity there are two partitions and neither partition has quorum. This results in the cluster halting the service. You can mitigate this problem by adding a third, diskless node to the cluster which will then act as a quorum tiebreaker.
Refer to Using a Diskless Node as a Tiebreaker for more information.
2.24. Resync-after
DRBD runs all its necessary resync operations in parallel so that nodes are reintegrated with up-to-date data as soon as possible. This works well when there is one DRBD resource per backing disk.
However, when DRBD resources share a physical disk (or when a single resource spans multiple volumes), resyncing these resources (or volumes) in parallel results in a nonlinear access pattern. Hard disks perform much better with a linear access pattern. For such cases you can serialize resyncs using the resync-after
keyword within a disk
section of a DRBD resource configuration file.
See here for an example.
2.25. Failover Clusters
In many scenarios it is useful to combine DRBD with a failover cluster resource manager. DRBD can integrate with a cluster resource manager (CRM) such as DRBD Reactor and its promoter plug-in, or Pacemaker, to create failover clusters.
DRBD Reactor is an open source tool that monitors DRBD events and reacts to them. Its promoter plug-in manages services using systemd unit files or OCF resource agents. Since DRBD Reactor solely relies on DRBD’s cluster communication, no configuration for its own communication is needed.
DRBD Reactor requires that quorum is enabled on the DRBD resources it is monitoring, so a failover cluster must have a minimum of three nodes. A limitation is that it supports ordering of services only for collocated services. One of its advantages is that it makes possible fully automatic recovery of clusters after a temporary network failure. This, together with its simplicity, make it the recommended failover cluster manager. Furthermore, DRBD Reactor is perfectly suitable for deployments on clouds as it needs no STONITH or redundant networks in deployments with three or more nodes (for quorum).
Pacemaker is the longest available open source cluster resource manager for high-availability clusters. It requires its own communication layer (Corosync) and it requires STONITH to deal with various scenarios. STONITH might require dedicated hardware and it can increase the impact radius of a service failure. Pacemaker probably has the most flexible system to express resource location and ordering constraints. However, with this flexibility, setups can become complex.
Finally, there are also proprietary solutions for failover clusters that work with DRBD, such as SIOS LifeKeeper for Linux, HPE Serviceguard for Linux, and Veritas Cluster Server.
2.26. DRBD Integration for VCS
Veritas Cluster Server (or Veritas InfoScale Availability) is a commercial alternative to the Pacemaker open source software. In case you need to integrate DRBD resources into a VCS setup please see the README in drbd-utils/scripts/VCS on github.
Building and Installing the DRBD Software
3. Installing Prebuilt DRBD Binary Packages
3.1. LINBIT Supplied Packages
LINBIT, the DRBD project’s sponsor company, provides binary packages to its commercial support customers.
These packages are available through repositories and package manager commands (for example, apt
, dnf
), and when reasonable through LINBIT’s Docker
registry. Packages and images from these sources are considered “official” builds.
These builds are available for the following distributions:
-
Red Hat Enterprise Linux (RHEL), versions 7, 8 and 9
-
SUSE Linux Enterprise Server (SLES), versions 12 and 15
-
Debian GNU/Linux, 9 (stretch), 10 (buster), and 11 (bullseye)
-
Ubuntu Server Edition LTS 18.04 (Bionic Beaver), LTS 20.04 (Focal Fossa), and LTS 22.04 (Jammy Jellyfish)
-
Oracle Linux (OL), versions 8 and 9
Refer to the LINBIT Kernel Module Signing for Secure Boot section for information about which specific DRBD kernel modules have signed packages for which distributions.
Packages for some other distributions are built as well, but don’t receive as much testing.
LINBIT releases binary builds in parallel with any new DRBD source release.
Package installation on RPM-based systems (SLES, RHEL, AlmaLinux) is done by
simply using dnf install
(for new installations) or dnf update
(for
upgrades).
For DEB-based systems (Debian GNU/Linux, Ubuntu) systems,
drbd-utils
and drbd-module-`uname -r`
packages are installed by using apt install
,
3.1.1. Using a LINBIT Helper Script to Register Nodes and Configure Package Repositories
If you are a LINBIT customer, you can install DRBD and dependencies that you may need from LINBIT’s customer repositories. To access those repositories you will need to have been set up in LINBIT’s system and have access to the LINBIT Customer Portal. If you have not been set up in LINBIT’s system, or if you want an evaluation account, you can contact a sales team member: [email protected].
Using the LINBIT Customer Portal to Register Nodes
Once you have access to the LINBIT Customer Portal, you can register your cluster nodes and configure repository access by using LINBIT’s Python helper script. See the Register Nodes section of the Customer Portal for details about this script.
Downloading and Running the LINBIT Manage Nodes Helper Script
To download and run the LINBIT helper script to register your nodes and configure LINBIT repository access, enter the following commands on all nodes, one node at a time:
# curl -O https://my.linbit.com/linbit-manage-node.py # chmod +x ./linbit-manage-node.py # ./linbit-manage-node.py
The script must be run as superuser. |
If the error message no python interpreter found :-( is displayed when running
linbit-manage-node.py , enter the command dnf -y install python3 (RPM-based distributions) or
apt -y install python3 (DEB-based distributions) to install Python 3.
|
The script will prompt you to enter your LINBIT Customer Portal username and password. After validating your credentials, the script will list clusters and nodes (if you have any already registered) that are associated with your account.
Joining Nodes to a Cluster
Select the cluster that you want to register the current node with. If you want the node to be the first node in a new cluster, select the “new cluster” option.
Saving the Registration and Repository Configurations to Files
To save the registration information on your node, confirm the writing of registration data to a JSON file, when the helper script prompts you to.
Writing registration data: --> Write to file (/var/lib/drbd-support/registration.json)? [y/N]
To save the LINBIT repository configuration to a file on your node, confirm the writing of a
linbit.repo
file, when the helper script prompts you to.
Enabling Access to LINBIT Repositories
After registering a node by using the LINBIT manage node helper script and joining the node to a cluster, the script will show you a menu of LINBIT repositories.
To install DRBD, its dependencies, and related packages, enable the drbd-9
repository.
The drbd-9 repository includes the latest DRBD 9 version. It also includes other LINBIT
software packages, including LINSTOR®, DRBD Reactor, LINSTOR GUI, OCF resource agents, and
others.
|
Installing LINBIT’s Public Key and Verifying LINBIT Repositories
After enabling LINBIT repositories and confirming your selection, be sure to respond yes to the questions about installing LINBIT’s public key to your keyring and writing the repository configuration file.
Before it closes, the script will show a message that suggests different packages that you can install for different use cases.
Verifying LINBIT Repositories
After the LINBIT manage node helper script completes, you can verify that you enabled LINBIT
repositories by using the dnf info
or apt info
command, after updating your package
manager’s package metadata.
On RPM-based systems, enter:
# dnf --refresh info drbd-utils
On DEB-based systems, enter:
# apt update && apt info drbd-utils
Output from the package manager info
command should show that the package manager is pulling
package information from LINBIT repositories.
Excluding Packages from Red Hat or AlmaLinux Repositories
If you are using an RPM-based Linux distribution, before installing DRBD, be sure to only pull DRBD and related packages from LINBIT repositories. To do this, you will need to exclude certain packages from your RPM-based distribution’s repositories that overlap with packages in the LINBIT customer repositories.
The commands that follow insert an “exclude” line after the occurrence of every enabled repository line in all files in the repositories configuration directory, except for LINBIT repository files.
To exclude the relevant DRBD packages from enabled repositories on RPM-based distributions, enter the commands:
# RPM_REPOS="`ls /etc/yum.repos.d/*.repo|grep -v linbit`" # PKGS="drbd kmod-drbd" # for file in $RPM_REPOS; do sed -i "/^enabled[ =]*1/a exclude=$PKGS" $file; done
Using the Helper Script’s Suggested Package Manager Command to Install DRBD
To install DRBD, you can use the package manager command that the LINBIT helper script showed before the script completed. The relevant command was shown after this line:
If you don't intend to run an SDS satellite or controller, a useful set is: [...]
If you need to refer to the helper script’s suggested actions some time after the script
completes, you can run the script again using the # ./linbit-manage-node.py --hints |
On DEB based systems you can install a precompiled DRBD kernel module package,
drbd-module-$(uname -r) , or a source version of the kernel module, drbd-dkms . Install one
or the other package but not both.
|
3.1.2. LINBIT Kernel Module Signing for Secure Boot
LINBIT signs most of its kernel module object files, the following table gives an overview when signing for distributions started:
Distribution | Module signing since DRBD release |
---|---|
RHEL7 |
8.4.12/9.0.25/9.1.0 |
RHEL8 |
9.0.25/9.1.0 |
RHEL9+ |
all available |
SLES15 |
9.0.31/9.1.4 |
Debian |
9.0.30/9.1.3 |
Ubuntu |
9.0.30/9.1.3 |
Oracle Linux |
9.1.17/9.2.6 |
The public signing key is shipped in the RPM package and gets installed to
/etc/pki/linbit/SECURE-BOOT-KEY-linbit.com.der
. It can be enrolled with the following command:
# mokutil --import /etc/pki/linbit/SECURE-BOOT-KEY-linbit.com.der input password: input password again:
A password can be chosen freely. It will be used when the key is actually enrolled to the MOK list after the required reboot.
3.2. LINBIT Supplied Docker Images
LINBIT provides a Docker registry for its commercial support customers. The registry is accessible through the host name ‘drbd.io’.
LINBIT’s container image repository (http://drbd.io) is only available to LINBIT customers or through LINBIT customer trial accounts. Contact LINBIT for information on pricing or to begin a trial. Alternatively, you may use LINSTOR SDS’ upstream project named Piraeus, without being a LINBIT customer. |
Before you can pull images, you have to log in to the registry:
# docker login drbd.io
After a successful login, you can pull images. To test your login and the registry, start by issuing the following command:
# docker pull drbd.io/drbd-utils # docker run -it --rm drbd.io/drbd-utils # press CTRL-D to exit
3.3. Distribution Supplied Packages
Several Linux distributions provide DRBD, including prebuilt binary packages. Support for these builds, if any, is being provided by the associated distribution vendor. Their release cycle may lag behind DRBD source releases.
3.3.1. SUSE Linux Enterprise Server
SLES High Availability Extension (HAE) includes DRBD.
On SLES, DRBD is normally installed through the software installation component of YaST2. It comes bundled with the High Availability Extension package selection.
Users who prefer a command line install may simply issue:
# yast -i drbd
or
# zypper install drbd
3.3.2. CentOS
CentOS has had DRBD 8 since release 5; for DRBD 9 you will need examine EPEL and similar sources.
DRBD can be installed using yum
(note that you will need a
correct repository enabled for this to work):
# yum install drbd kmod-drbd
3.3.3. Ubuntu Linux
For Ubuntu LTS, LINBIT offers a PPA repository at https://launchpad.net/~linbit/+archive/ubuntu/linbit-drbd9-stack. See Adding Launchpad PPA Repositories for more information.
# apt install drbd-utils drbd-dkms
3.4. Compiling Packages from Source
Releases generated by Git tags on github are snapshots of the Git repository at the
given time. You most likely do not want to use these. They might lack things such as generated man pages, the
configure
script, and other generated files. If you want to build from a tar file, use the ones
provided by us.
All our projects contain standard build scripts (e.g., Makefile
, configure
). Maintaining specific
information per distribution (e.g., documenting broken build macros) is too cumbersome, and historically the
information provided in this section got outdated quickly. If you don’t know how to build software the
standard way, please consider using packages provided by LINBIT.
4. Building and installing DRBD from source
4.1. Downloading the DRBD Sources
The source tar files for both current and historic DRBD releases are
available for download from https://pkg.linbit.com/. Source
tar files, by convention, are named drbd-x.y.z.tar.gz
, for example,
drbd-utils-x.y.z.tar.gz
, where x, y and
z refer to the major, minor and bug fix release numbers.
DRBD’s compressed source archive is less than half a megabyte in
size. After downloading a tar file, you can decompress its contents into your current working directory,
by using the tar -xzf
command.
For organizational purposes, decompress DRBD into a directory normally used
for keeping source code, such as /usr/src
or /usr/local/src
. The
examples in this guide assume /usr/src
.
4.2. Checking out Sources from the Public DRBD Source Repository
DRBD’s source code is kept in a public Git repository. You can browse this online at https://github.com/LINBIT. The DRBD software consists of these projects:
-
The DRBD kernel module
-
The DRBD utilities
Source code can be obtained by either cloning Git repositories or downloading release tar files. There are two minor differences between an unpacked source tar file and a Git checkout of the same release:
-
The Git checkout contains a
debian/
subdirectoy, while the source tar file does not. This is due to a request from Debian maintainers, who prefer to add their own Debian build configuration to a pristine upstream tar file. -
The source tar file contains preprocessed man pages, the Git checkout does not. Therefore, building DRBD from a Git checkout requires a complete Docbook toolchain for building the man pages, while this is not a requirement for building from a source tar file.
4.2.1. DRBD Kernel Module
To check out a specific DRBD release from the repository, you must first clone the DRBD repository:
git clone --recursive https://github.com/LINBIT/drbd.git
This command will create a Git checkout subdirectory, named
drbd
. To now move to a source code state equivalent to a
specific DRBD release (here 9.2.3), issue the following commands:
$ cd drbd $ git checkout drbd-9.2.3 $ git submodule update
4.3. Building DRBD from Source
After cloning the DRBD and related utilities source code repositories to your local host, you can proceed to building DRBD from the source code.
4.3.1. Checking Build Prerequisites
Before being able to build DRBD from source, your build host must fulfill the following prerequisites:
-
make
,gcc
, the glibc development libraries, and theflex
scanner generator must be installed.You should verify that the gcc
you use to compile the module is the same that was used to build the kernel you are running. If you have multiplegcc
versions available on your system, DRBD’s build system includes a facility to select a specificgcc
version. -
For building directly from a Git checkout, GNU Autoconf is also required. This requirement does not apply when building from a tar file.
-
If you are running a stock kernel supplied by your distribution, you should install a matching kernel headers package. These are typically named
kernel-devel
,kernel-headers
,linux-headers
or similar. In this case, you can skip Preparing the Kernel Source Tree and continue with Preparing the DRBD Userspace Utilities Build Tree. -
If you are not running a distribution stock kernel (that is, your system runs on a kernel built from source with a custom configuration), your kernel source files must be installed.
On RPM-based systems, these packages will be named similar to kernel-source-version.rpm
, which is easily confused withkernel-version.src.rpm
. The former is the correct package to install for building DRBD.
“Vanilla” kernel tar files from the http://kernel.org/ archive are simply named
linux-version.tar.bz2
and should be unpacked in
/usr/src/linux-version
, with the symlink /usr/src/linux
pointing
to that directory.
In this case of building DRBD against kernel sources (not headers), you must continue with Preparing the Kernel Source Tree.
4.3.2. Preparing the Kernel Source Tree
To prepare your source tree for building DRBD, you must first enter
the directory where your unpacked kernel sources are
located. Typically this is /usr/src/linux-version
, or simply a
symbolic link named /usr/src/linux
:
# cd /usr/src/linux
The next step is recommended, though not strictly necessary. Be sure
to copy your existing .config
file to a safe location before
performing it. This step essentially reverts your kernel source tree
to its original state, removing any leftovers from an earlier build or
configure run:
# make mrproper
Now it is time to clone your currently running kernel configuration into the kernel source tree. There are a few possible options for doing this:
-
Many reasonably recent kernel builds export the currently-running configuration, in compressed form, via the
/proc
filesystem, enabling you to copy from there:
# zcat /proc/config.gz > .config
-
SUSE kernel Makefiles include a
cloneconfig
target, so on those systems, you can issue:
# make cloneconfig
-
Some installs put a copy of the kernel config into
/boot
, which allows you to do this:
# cp /boot/config-$(uname -r).config
-
Finally, you can simply use a backup copy of a
.config
file which has been used for building the currently-running kernel.
4.3.3. Preparing the DRBD Userspace Utilities Build Tree
The DRBD userspace compilation requires that you first configure your
source tree with the included configure
script.
When building from a Git checkout, the configure script does
not yet exist. You must create it by simply typing autoconf from the
top directory of the checkout.
|
Invoking the configure script with the --help
option returns a full
list of supported options. The table below summarizes the most
important ones:
Option | Description | Default | Remarks |
---|---|---|---|
–prefix |
Installation directory prefix |
|
This is the default to maintain Filesystem Hierarchy Standard compatibility for locally installed, unpackaged software. In packaging, this is typically overridden with |
–localstatedir |
Local state directory |
|
Even with a default |
–sysconfdir |
System configuration directory |
|
Even with a default |
–with-udev |
Copy a rules file into your |
yes |
Disable for non- |
–with-heartbeat |
Build DRBD Heartbeat integration |
yes |
You can disable this option unless you are planning to use DRBD’s Heartbeat v1 resource agent or |
–with-pacemaker |
Build DRBD Pacemaker integration |
yes |
You can disable this option if you are not planning to use the Pacemaker cluster resource manager. |
–with-rgmanager |
Build DRBD Red Hat Cluster Suite integration |
no |
You should enable this option if you are planning to use DRBD with |
–with-xen |
Build DRBD Xen integration |
yes (on x86 architectures) |
You can disable this option if you don’t need the |
–with-bashcompletion |
Installs a bash completion script for |
yes |
You can disable this option if you are using a shell other than bash, or if you do not want to use programmable completion for the |
–with-initscripttype |
Type of your init system |
auto |
Type of init script to install (sysv, systemd, or both). |
–enable-spec |
Create a distribution specific RPM spec file |
no |
For package builders only: you can use this option if you want to create an RPM spec file adapted to your distribution. See also Building the DRBD userspace RPM packages. |
Most users will want the following configuration options:
$ ./configure --prefix=/usr --localstatedir=/var --sysconfdir=/etc
The configure script will adapt your DRBD build to distribution specific needs. It does so by auto-detecting which distribution it is being invoked on, and setting defaults accordingly. When overriding defaults, do so with caution.
The configure script creates a log file, config.log
, in the
directory where it was invoked. When reporting build issues on the
mailing list, it is usually wise to either attach a copy of that file
to your email, or point others to a location from where it can be
viewed or downloaded.
4.3.4. Building DRBD Userspace Utilities
To build DRBD’s userspace utilities, invoke the following commands from the top of your Git checkout or expanded tar file:
$ make $ sudo make install
This will build the management utilities (drbdadm
, drbdsetup
, and
drbdmeta
), and install them in the appropriate locations. Based on
the other --with
options selected during the
configure stage, it will also install
scripts to integrate DRBD with other applications.
4.3.5. Compiling the DRBD Kernel Module
The kernel module does not use GNU
autotools
, therefore building and
installing the kernel module is usually a simple two step process.
Building the DRBD Kernel Module for the Currently Running Kernel
After changing into your unpacked DRBD kernel module sources directory, you can now build the module:
$ cd drbd-9.0 $ make clean all
This will build the DRBD kernel module to match your currently-running
kernel, whose kernel source is expected to be accessible via the
/lib/modules/`uname -r
/build` symlink.
Building Against Prepared Kernel Headers
If the /lib/modules/`uname -r
/build` symlink does not exist, and you
are building against a running stock kernel (one that was shipped
pre-compiled with your distribution), you can also set the KDIR
variable to point to the matching kernel headers (as opposed to
kernel sources) directory. Note that besides the actual kernel headers — commonly found in /usr/src/linux-version/include
— the
DRBD build process also looks for the kernel Makefile
and
configuration file (.config
), which pre-built kernel headers
packages commonly include.
To build against prepared kernel headers, issue, for example:
$ cd drbd-9.0 $ make clean $ make KDIR=/usr/src/linux-headers-3.2.0-4-amd64/
Building Against a Kernel Source Tree
If you are building DRBD against a kernel other than your currently running one, and you do not have prepared kernel sources for your target kernel available, you need to build DRBD against a complete target kernel source tree. To do so, set the KDIR variable to point to the kernel sources directory:
$ cd drbd-9.0 $ make clean $ make KDIR=/root/linux-3.6.6/
Using a Non-default C Compiler
You also have the option of setting the compiler explicitly via the CC variable. This is known to be necessary on some Fedora versions, for example:
$ cd drbd-9.0 $ make clean $ make CC=gcc32
Checking for successful build completion
If the module build completes successfully, you should see a kernel
module file named drbd.ko
in the drbd
directory. You can
interrogate the newly-built module with /sbin/modinfo drbd.ko
if you
are so inclined.
Kernel Application Binary Interface warning for some distributions
Please note that some distributions (like RHEL 6 and derivatives) claim to have a stable kernel application binary interface (kABI), that is, the kernel API should stay consistent during minor releases (that is, for kernels published in the RHEL 6.3 series).
In practice this is not working all of the time; there are some known cases (even within a minor release) where things got changed incompatibly. In these cases external modules (like DRBD) can fail to load, cause a kernel panic, or break in even more subtle ways[3], and need to be rebuilt against the matching kernel headers.
4.4. Installing DRBD
Provided your DRBD build completed successfully, you will be able to install DRBD by issuing the command:
$ cd drbd-9.0 && sudo make install && cd .. $ cd drbd-utils && sudo make install && cd ..
The DRBD userspace management tools (drbdadm
, drbdsetup
, and
drbdmeta
) will now be installed in the prefix
path that was passed to
configure
, typically /sbin/
.
Note that any kernel upgrade will require you to rebuild and reinstall the DRBD kernel module to match the new kernel.
Some distributions allow to register kernel module source directories, so that
rebuilds are done as necessary. See e.g. dkms(8)
on Debian.
The DRBD userspace tools, in contrast, need only to be rebuilt and reinstalled when upgrading to a new DRBD version. If at any time you upgrade to a new kernel and new DRBD version, you will need to upgrade both components.
4.5. Building the DRBD userspace RPM packages
The DRBD build system contains a facility to build RPM packages
directly out of the DRBD source tree. For building RPMs,
Checking Build Prerequisites applies essentially in the same way as for building
and installing with make
, except that you also need the RPM build
tools, of course.
Also, see Preparing the Kernel Source Tree if you are not building against a running kernel with precompiled headers available.
The build system offers two approaches for building RPMs. The simpler
approach is to simply invoke the rpm
target in the top-level
Makefile:
$ ./configure $ make rpm
This approach will auto-generate spec files from pre-defined templates, and then use those spec files to build binary RPM packages.
The make rpm
approach generates several RPM packages:
Package name | Description | Dependencies | Remarks |
---|---|---|---|
drbd |
DRBD meta-package |
All other |
Top-level virtual package. When installed, this pulls in all other userland packages as dependencies. |
drbd-utils |
Binary administration utilities |
Required for any DRBD enabled host |
|
drbd-udev |
udev integration facility |
|
Enables |
drbd-xen |
Xen DRBD helper scripts |
|
Enables |
drbd-heartbeat |
DRBD Heartbeat integration scripts |
|
Enables DRBD management by legacy v1-style Heartbeat clusters |
drbd-pacemaker |
DRBD Pacemaker integration scripts |
|
Enables DRBD management by Pacemaker clusters |
drbd-rgmanager |
DRBD RedHat Cluster Suite integration scripts |
|
Enables DRBD management by |
drbd-bashcompletion |
Programmable bash completion |
|
Enables Programmable bash completion for the |
The other, more flexible approach is to have configure
generate the
spec file, make any changes you deem necessary, and then use the
rpmbuild
command:
$ ./configure --enable-spec $ make tgz $ cp drbd*.tar.gz `rpm -E %sourcedir` $ rpmbuild -bb drbd.spec
The RPMs will be created wherever your system RPM configuration (or
your personal ~/.rpmmacros
configuration) dictates.
After you have created these packages, you can install, upgrade, and uninstall them as you would any other RPM package in your system.
Note that any kernel upgrade will require you to generate a new
kmod-drbd
package to match the new kernel; see also Kernel Application Binary Interface warning for some distributions.
The DRBD userland packages, in contrast, need only be recreated when upgrading to a new DRBD version. If at any time you upgrade to a new kernel and new DRBD version, you will need to upgrade both packages.
4.6. Building a DRBD Debian package
The DRBD build system contains a facility to build Debian packages
directly out of the DRBD source tree. For building Debian packages,
Checking Build Prerequisites applies essentially in the same way as for building
and installing with make
, except that you of course also need the
dpkg-dev
package containing the Debian packaging tools, and
fakeroot
if you want to build DRBD as a non-root user (highly
recommended). All DRBD sub-projects (kernel module and drbd-utils
) support Debian package building.
Also, see Preparing the Kernel Source Tree if you are not building against a running kernel with precompiled headers available.
The DRBD source tree includes a debian
subdirectory containing the
required files for Debian packaging. That subdirectory, however, is
not included in the DRBD source tar files — instead, you will
need to create a Git checkout of a tag
associated with a specific DRBD release.
Once you have created your checkout in this fashion, you can issue the following commands to build DRBD Debian packages:
$ dpkg-buildpackage -rfakeroot -b -uc
This (example) drbd-buildpackage invocation enables a
binary-only build (-b ) by a non-root user (-rfakeroot ),
disabling cryptographic signature for the changes file (-uc ). Of
course, you might prefer other build options, see the
dpkg-buildpackage man page for details.
|
This build process will create the following Debian packages:
-
A package containing the DRBD userspace tools, named
drbd-utils_x.y.z-ARCH.deb
; -
A module source package suitable for
module-assistant
nameddrbd-module-source_x.y.z-BUILD_all.deb
. -
A dkms package suitable for
dkms
nameddrbd-dkms_x.y.z-BUILD_all.deb
.
After you have created these packages, you can install, upgrade, and uninstall them as you would any other Debian package in your system.
The drbd-utils
packages supports Debian’s dpkg-reconfigure
facility, which
can be used to switch which versions of the man-pages are shown by default
(8.3, 8.4, or 9.0).
Building and installing the actual kernel module from the installed
module source package is easily accomplished via Debian’s
module-assistant
facility:
# module-assistant auto-install drbd-module
You can also use the shorthand form of the above command:
# m-a a-i drbd-module
Note that any kernel upgrade will require you to rebuild the kernel
module (with module-assistant
, as just described) to match the new
kernel. The drbd-utils
and drbd-module-source
packages, in
contrast, only need to be recreated when upgrading to a new DRBD
version. If at any time you upgrade to a new kernel and new DRBD
version, you will need to upgrade both packages.
Starting from DRBD9, automatic updates of the DRBD kernel module are possible
with the help of dkms(8)
. All that is needed is to install the drbd-dkms
Debian package.
Working with DRBD
5. Common Administrative Tasks
This chapter outlines typical administrative tasks encountered during day-to-day operations. It does not cover troubleshooting tasks, these are covered in detail in Troubleshooting and Error Recovery.
5.1. Configuring DRBD
5.1.1. Preparing your lower-level storage
After you have installed DRBD, you must set aside a roughly identically sized storage area on both cluster nodes. This will become the lower-level device for your DRBD resource. You may use any type of block device found on your system for this purpose. Typical examples include:
-
A hard drive partition (or a full physical hard drive),
-
a software RAID device,
-
an LVM Logical Volume or any other block device configured by the Linux device-mapper infrastructure,
-
any other block device type found on your system.
You may also use resource stacking, meaning you can use one DRBD device as a lower-level device for another. Some specific considerations apply to stacked resources; their configuration is covered in detail in Creating a Stacked Three-node Setup.
While it is possible to use loop devices as lower-level devices for DRBD, doing so is not recommended due to deadlock issues. |
It is not necessary for this storage area to be empty before you create a DRBD resource from it. In fact it is a common use case to create a two-node cluster from a previously non-redundant single-server system using DRBD (some caveats apply — please refer to DRBD Metadata if you are planning to do this).
For the purposes of this guide, we assume a very simple setup:
-
Both hosts have a free (currently unused) partition named
/dev/sda7
. -
We are using internal metadata.
5.1.2. Preparing your network configuration
It is recommended, though not strictly required, that you run your DRBD replication over a dedicated connection. At the time of this writing, the most reasonable choice for this is a direct, back-to-back, Gigabit Ethernet connection. When DRBD is run over switches, use of redundant components and the bonding driver (in active-backup mode) is recommended.
It is generally not recommended to run DRBD replication via routers, for reasons of fairly obvious performance drawbacks (adversely affecting both throughput and latency).
In terms of local firewall considerations, it is important to understand that DRBD (by convention) uses TCP ports from 7788 upwards, with every resource listening on a separate port. DRBD uses two TCP connections for every resource configured. For proper DRBD functionality, it is required that these connections are allowed by your firewall configuration.
Security considerations other than firewalling may also apply if a Mandatory Access Control (MAC) scheme such as SELinux or AppArmor is enabled. You may have to adjust your local security policy so it does not keep DRBD from functioning properly.
You must, of course, also ensure that the TCP ports for DRBD are not already used by another application.
Since DRBD version 9.2.6, it is possible to configure a DRBD resource to support more than one TCP connection pair, for traffic load balancing purposes. Refer to the Load Balancing DRBD Traffic section for details.
For the purposes of this guide, we assume a very simple setup:
-
Our two DRBD hosts each have a currently unused network interface,
eth1
, with IP addresses10.1.1.31
and10.1.1.32
assigned to it, respectively. -
No other services are using TCP ports 7788 through 7799 on either host.
-
The local firewall configuration allows both inbound and outbound TCP connections between the hosts over these ports.
5.1.3. Configuring your resource
All aspects of DRBD are controlled in its configuration file,
/etc/drbd.conf
. Normally, this configuration file is just a skeleton
with the following contents:
include "/etc/drbd.d/global_common.conf"; include "/etc/drbd.d/*.res";
By convention, /etc/drbd.d/global_common.conf
contains the
global
and common
sections of the DRBD configuration, whereas the .res
files contain
one resource
section each.
It is also possible to use drbd.conf
as a flat configuration file
without any include
statements at all. Such a configuration,
however, quickly becomes cluttered and hard to manage, which is why
the multiple-file approach is the preferred one.
Regardless of which approach you employ, you should always make sure
that drbd.conf
, and any other files it includes, are exactly
identical on all participating cluster nodes.
The DRBD source tarball contains an example configuration file in the
scripts
subdirectory. Binary installation packages will either
install this example configuration directly in /etc
, or in a
package-specific documentation directory such as
/usr/share/doc/packages/drbd
.
This section describes only those few aspects of the configuration
file which are absolutely necessary to understand in order to get DRBD
up and running. The configuration file’s syntax and contents are
documented in great detail in the man page of drbd.conf
.
Example configuration
For the purposes of this guide, we assume a minimal setup in line with the examples given in the previous sections:
/etc/drbd.d/global_common.conf
)global { usage-count yes; } common { net { protocol C; } }
/etc/drbd.d/r0.res
)resource "r0" { device minor 1; disk "/dev/sda7"; meta-disk internal; on "alice" { node-id 0; } on "bob" { node-id 1; } connection { host "alice" address 10.1.1.31:7789; host "bob" address 10.1.1.32:7789; } }
This example configures DRBD in the following fashion:
-
You “opt in” to be included in DRBD’s usage statistics (see
usage-count
). -
Resources are configured to use fully synchronous replication (Protocol C) unless explicitly specified otherwise.
-
Our cluster consists of two nodes, ‘alice’ and ‘bob’.
-
We have a resource arbitrarily named
r0
which uses/dev/sda7
as the lower-level device, and is configured with internal metadata. -
The resource uses TCP port 7789 for its network connections, and binds to the IP addresses 10.1.1.31 and 10.1.1.32, respectively.
The configuration above implicitly creates one volume in the
resource, numbered zero (0
). For multiple volumes in one resource,
modify the syntax as follows (assuming that the same lower-level storage block
devices are used on both nodes):
/etc/drbd.d/r0.res
)resource "r0" { volume 0 { device minor 1; disk "/dev/sda7"; meta-disk internal; } volume 1 { device minor 2; disk "/dev/sda8"; meta-disk internal; } on "alice" { node-id 0; } on "bob" { node-id 1; volume 1 { disk "/dev/sda9"; } } connection { host "alice" address 10.1.1.31:7789; host "bob" address 10.1.1.32:7789; } }
-
Host sections (‘on’ keyword) inherit volume sections from the resource level. They may contain volume themselves, these values have precedence over inherited values.
Volumes may also be added to existing resources on the fly. For an example see Adding a New DRBD Volume to an Existing Volume Group. |
For compatibility with older releases of DRBD it supports also drbd-8.4 like configuration files.
resource r0 { on alice { device /dev/drbd1; disk /dev/sda7; meta-disk internal; address 10.1.1.31:7789; } on bob { device /dev/drbd1; disk /dev/sda7; meta-disk internal; address 10.1.1.32:7789; }
-
Strings that do not contain keywords might be given without double quotes
"
. -
In the old (8.4) version, the way to specify the device was by using a string that specified the name of the resulting
/dev/drbdX
device file. -
Two node configurations get node numbers assigned by drbdadm.
-
A pure two node configuration gets an implicit connection.
The global
section
This section is allowed only once in the configuration. It is normally
in the /etc/drbd.d/global_common.conf
file. In a single-file
configuration, it should go to the very top of the configuration
file. Of the few options available in this section, only one is of
relevance to most users:
usage-count
The DRBD project keeps statistics about the usage of various DRBD
versions. This is done by contacting an HTTP server every time a new
DRBD version is installed on a system. This can be disabled by setting
usage-count no;
. The default is usage-count ask;
which will
prompt you every time you upgrade DRBD.
DRBD’s usage statistics are, of course, publicly available: see http://usage.drbd.org.
The common
section
This section provides a shorthand method to define configuration
settings inherited by every resource. It is normally found in
/etc/drbd.d/global_common.conf
. You may define any option you can
also define on a per-resource basis.
Including a common
section is not strictly required, but strongly
recommended if you are using more than one resource. Otherwise, the
configuration quickly becomes convoluted by repeatedly-used options.
In the example above, we included net { protocol C; }
in the
common
section, so every resource configured (including r0
)
inherits this option unless it has another protocol
option
configured explicitly. For other synchronization protocols available,
see Replication Modes.
The resource
sections
A per-resource configuration file is usually named
/etc/drbd.d/resource.res
. Any DRBD resource you define must be
named by specifying a resource name in the configuration. The convention
is to use only letters, digits, and the underscore; while it is technically
possible to use other characters as well, you won’t like the result if you ever
need the more specific resource:_peer/volume
syntax.
Every resource configuration must also have at least two on host
sub-sections,
one for every cluster node. All other configuration settings are
either inherited from the common
section (if it exists), or derived
from DRBD’s default settings.
In addition, options with equal values on all hosts
can be specified directly in the resource
section. Thus, we can
further condense our example configuration as follows:
resource "r0" { device minor 1; disk "/dev/sda7"; meta-disk internal; on "alice" { address 10.1.1.31:7789; } on "bob" { address 10.1.1.32:7789; } }
5.1.4. Defining network connections
Currently the communication links in DRBD 9 must build a full mesh, i.e. in every resource every node must have a direct connection to every other node (excluding itself, of course).
For the simple case of two hosts drbdadm
will insert the (single) network
connection by itself, for ease of use and backwards compatibility.
The net effect of this is a quadratic number of network connections over hosts. For the “traditional” two nodes one connection is needed; for three hosts there are three node pairs; for four, six pairs; 5 hosts: 10 connections, and so on. For (the current) maximum of 16 nodes there will be 120 host pairs to connect.
An example configuration file for three hosts would be this:
resource r0 { device minor 1; disk "/dev/sda7"; meta-disk internal; on alice { address 10.1.1.31:7000; node-id 0; } on bob { address 10.1.1.32:7000; node-id 1; } on charlie { address 10.1.1.33:7000; node-id 2; } connection-mesh { hosts alice bob charlie; } }
If have enough network cards in your servers, you can create direct cross-over links between server pairs. A single four-port ethernet card allows you to have a single management interface, and to connect three other servers, to get a full mesh for four cluster nodes.
In this case you can specify a different IP address to use the direct link:
resource r0 { ... connection { host alice address 10.1.2.1:7010; host bob address 10.1.2.2:7001; } connection { host alice address 10.1.3.1:7020; host charlie address 10.1.3.2:7002; } connection { host bob address 10.1.4.1:7021; host charlie address 10.1.4.2:7012; } }
For easier maintenance and debugging, it’s recommended that you have different ports for each
endpoint. This will allow you to more easily associate packets to an endpoint when doing a
tcpdump
. The examples below will still be using two servers only; please see
Example configuration for four nodes for a four-node example.
5.1.5. Configuring multiple paths
DRBD allows configuring multiple paths per connection, by introducing multiple path sections in a connection. Please see the following example:
resource <resource> { ... connection { path { host alpha address 192.168.41.1:7900; host bravo address 192.168.41.2:7900; } path { host alpha address 192.168.42.1:7900; host bravo address 192.168.42.2:7900; } } ... }
Obviously the two endpoint hostnames need to be equal in all paths of a connection. Paths may be on different IPs (potentially different NICs) or may only be on different ports.
The TCP transport uses one path at a time, unless you have configured load balancing (refer to Load Balancing DRBD Traffic). If the backing TCP connections get dropped, or show timeouts, the TCP transport implementation tries to establish a connection over the next path. It goes over all paths in a round-robin fashion until a connection gets established.
The RDMA transport uses all paths of a connection concurrently and it balances the network traffic between the paths evenly.
5.1.6. Configuring transport implementations
DRBD supports multiple network transports. A transport implementation can be configured for each connection of a resource.
TCP/IP
TCP is the default transport for DRBD replication traffic. Each DRBD resource connection where
the transport
option is not specified in the resource configuration will use the TCP
transport.
resource <resource> { net { transport "tcp"; } ... }
You can configure the tcp
transport with the following options, by specifying them in the
net
section of a resource configuration: sndbuf-size
, rcvbuf-size
, connect-int
,
socket-check-timeout
, ping-timeout
, timeout
, load-balance-paths
, and tls
. Refer to
man drbd.conf-9.0
for more details about each option.
Load Balancing DRBD Traffic
It is not possible at this time to use the DRBD TCP load balancing and TLS traffic encryption features concurrently on the same resource. |
By default, the TCP transport establishes a connection path between DRBD resource peers
serially, that is, one at a time. Since DRBD version 9.2.6, by setting the option
load-balance-paths
to yes
, you can enable the transport to establish all paths in parallel.
Also, when load balancing is configured, the transport will always send replicated traffic into
the path with the shortest send queue. Data can arrive out of order on the receiving side when
multiple paths are established. The DRBD transport implementation takes care of sorting the
received data packets and provides the data to the DRBD core in the original sending order.
Using the load balancing feature also requires a drbd-utils version 9.26.0 or
later. If you have an earlier version of drbd-utils installed, you might get “bad parser”
error messages when trying to run drbdadm commands against resources for which you have
configured load balancing.
|
An example configuration with load balancing configured for a DRBD resource named drbd-lb-0
,
is as follows:
drbd-lb-0.res
resource "drbd-lb-0" { [...] net { load-balance-paths yes; [...] } on "node-0" { volume 0 { [...] } node-id 0; } on "node-1" { volume 0 { [...] } node-id 1; } on "node-2" { volume 0 { [...] } node-id 2; } connection { path { host "node-0" address ipv4 192.168.220.60:7900; host "node-1" address ipv4 192.168.220.61:7900; } path { host "node-0" address ipv4 192.168.221.60:7900; host "node-1" address ipv4 192.168.221.61:7900; } } connection { path { host "node-0" address ipv4 192.168.220.60:7900; host "node-2" address ipv4 192.168.220.62:7900; } path { host "node-0" address ipv4 192.168.221.60:7900; host "node-2" address ipv4 192.168.221.62:7900; } } connection { path { host "node-1" address ipv4 192.168.220.61:7900; host "node-2" address ipv4 192.168.220.62:7900; } path { host "node-1" address ipv4 192.168.221.61:7900; host "node-2" address ipv4 192.168.221.62:7900; } } }
While the above configuration shows three DRBD connection paths, only two are necessary in
a three-node cluster. For example, if the above configuration was on node node-0 , the
connection between node-1 and node-2 would be unnecessary in the configuration. On
node-1 , the connection between node-0 and node-2 would be unnecessary, and so on,
for the configuration on node-2 . Nevertheless, it can be helpful to have all possible
connections in your resource configuration. This way, you can use a single configuration file on
all the nodes in your cluster without having to edit and customize the configuration on each
node.
|
Securing DRBD Connections with TLS
It is not possible at this time to use the DRBD TCP load balancing and TLS traffic encryption features concurrently on the same resource. |
You can enable authenticated and encrypted DRBD connections via the tcp
transport by adding
the tls
net option to a DRBD resource configuration file.
resource <resource> { net { tls yes; } ... }
DRBD will temporarily pass the sockets to a user space utility (tlshd
, part of the
ktls-utils
package) when establishing connections. tlshd
will use the keys configured in
/etc/tlshd.conf
to set up authentication and encryption.
[authenticate.client] x509.certificate=/etc/tlshd.d/tls.crt x509.private_key=/etc/tlshd.d/tls.key x509.truststore=/etc/tlshd.d/ca.crt [authenticate.server] x509.certificate=/etc/tlshd.d/tls.crt x509.private_key=/etc/tlshd.d/tls.key x509.truststore=/etc/tlshd.d/ca.crt
RDMA
You can configure DRBD resource replication traffic to use RDMA rather than TCP as a transport type by specifying it explicitly in a DRBD resource configuration.
resource <resource> { net { transport "rdma"; } ... }
You can configure the rdma
transport with the following options, by specifying them in the
net
section of the resource configuration: sndbuf-size
, rcvbuf-size
, max_buffers
,
connect-int
, socket-check-timeout
, ping-timeout
, timeout
. Refer to man drbd.conf-9.0
for more details about each option.
The rdma
transport is a zero-copy-receive transport. One implication of that is that the
max_buffers
configuration option must be set to a value big enough to hold all rcvbuf-size
.
rcvbuf-size is configured in bytes, while max_buffers is configured in pages. For
optimal performance max_buffers should be big enough to hold all of rcvbuf-size and the
amount of data that might be in transit to the back-end device at any point in time.
|
In case you are using InfiniBand host channel adapters (HCAs) with the rdma
transport, you also need to configure IP over InfiniBand (IPoIB). The IP address is not used for
data transfer, but it is used to find the right adapters and ports while establishing the
connection.
|
The configuration options sndbuf-size and rcvbuf-size are only considered at the
time a connection is established. While you can change their values when the connection is
established, your changes will only take effect when the connection is re-established.
|
Performance considerations for RDMA
By looking at the pseudo file /sys/kernel/debug/drbd/<resource>/connections/<peer>/transport,
the counts of available receive descriptors (rx_desc) and transmit descriptors (tx_desc)
can be monitored. If one of the descriptor kinds becomes depleted you should increase
sndbuf-size
or rcvbuf-size
.
5.1.7. Enabling your resource for the first time
After you have completed initial resource configuration as outlined in the previous sections, you can bring up your resource.
Each of the following steps must be completed on both nodes.
Please note that with our example config snippets (resource r0 { … }
), <resource>
would be
r0
.
This step must be completed only on initial device creation. It initializes DRBD’s metadata:
# drbdadm create-md <resource> v09 Magic number not found Writing meta data... initialising activity log NOT initializing bitmap New drbd meta data block successfully created.
The number of bitmap slots that are allocated in the meta-data depends on the number of hosts for this resource; per default the hosts in the resource configuration are counted. If all hosts are specified before creating the meta-data, this will “just work”; adding bitmap slots for further nodes is possible later, but incurs some manual work. |
This step associates the resource with its backing device (or devices, in case of a multi-volume resource), sets replication parameters, and connects the resource to its peer:
# drbdadm up <resource>
drbdadm status
The status command output should now contain information similar to the following:
# drbdadm status r0 r0 role:Secondary disk:Inconsistent bob role:Secondary disk:Inconsistent
The Inconsistent/Inconsistent disk state is expected at this point. |
By now, DRBD has successfully allocated both disk and network resources and is ready for operation. What it does not know yet is which of your nodes should be used as the source of the initial device synchronization.
5.1.8. The initial device synchronization
There are two more steps required for DRBD to become fully operational:
If you are dealing with newly-initialized, empty disks, this choice is entirely arbitrary. If one of your nodes already has valuable data that you need to preserve, however, it is of crucial importance that you select that node as your synchronization source. If you do initial device synchronization in the wrong direction, you will lose that data. Exercise caution.
This step must be performed on only one node, only on initial resource configuration, and only on the node you selected as the synchronization source. To perform this step, issue this command:
# drbdadm primary --force <resource>
After issuing this command, the initial full synchronization will
commence. You will be able to monitor its progress via
drbdadm status
. It may take some time depending on the size of the
device.
By now, your DRBD device is fully operational, even before the initial synchronization has completed (albeit with slightly reduced performance). If you started with empty disks you may now already create a filesystem on the device, use it as a raw block device, mount it, and perform any other operation you would with an accessible block device.
You will now probably want to continue with Working with DRBD, which describes common administrative tasks to perform on your resource.
5.1.9. Skipping initial resynchronization
If (and only if) you are starting DRBD resources from scratch (with no valuable data on them) you can use following command sequence to skip initial resync (don’t do that with data you want to keep on the devices):
On all nodes:
# drbdadm create-md <res> # drbdadm up <res>
The command drbdadm status
should now show all disks as Inconsistent.
Then, on one node execute the following command:
# drbdadm new-current-uuid --clear-bitmap <resource>/<volume>
or
# drbdsetup new-current-uuid --clear-bitmap <minor>
Running drbdadm status
now shows the disks as UpToDate (even though the
backing devices might be out of sync). You can now create a file
system on the disk and start using it.
Don’t do the above with data you want to keep or it gets corrupted. |
5.1.10. Using truck based replication
In order to preseed a remote node with data which is then to be kept synchronized, and to skip the initial full device synchronization, follow these steps.
This assumes that your local node has a configured, but disconnected
DRBD resource in the Primary role. That is to say, device
configuration is completed, identical drbd.conf
copies exist on both
nodes, and you have issued the commands for
initial resource promotion on your local node
but the remote node is not connected yet.
-
On the local node, issue the following command:
# drbdadm new-current-uuid --clear-bitmap <resource>/<volume>
or
# drbdsetup new-current-uuid --clear-bitmap <minor>
-
Create a consistent, verbatim copy of the resource’s data and its metadata. You may do so, for example, by removing a hot-swappable drive from a RAID-1 mirror. You would, of course, replace it with a fresh drive, and rebuild the RAID set, to ensure continued redundancy. But the removed drive is a verbatim copy that can now be shipped off site. If your local block device supports snapshot copies (such as when using DRBD on top of LVM), you may also create a bitwise copy of that snapshot using
dd
. -
On the local node, issue:
# drbdadm new-current-uuid <resource>
or the matching
drbdsetup
command.Note the absence of the
--clear-bitmap
option in this second invocation. -
Physically transport the copies to the remote peer location.
-
Add the copies to the remote node. This may again be a matter of plugging in a physical disk, or grafting a bitwise copy of your shipped data onto existing storage on the remote node. Be sure to restore or copy not only your replicated data, but also the associated DRBD metadata. If you fail to do so, the disk shipping process is moot.
-
On the new node you need to fix the node ID in the metadata, and exchange the peer-node info for the two nodes. Refer to the following command lines as an example for changing the node id from 2 to 1 on a resource
r0
volume0
.This must be done while the volume is not in use.
You need to edit the first four lines to match your needs. V is the resource name with the volume number. NODE_FROM is the node ID of the node the data originates from. NODE_TO is the node ID of the node where data will be replicated to. META_DATA_LOCATION is the location of the metadata which might be internal or flex-external.
V=r0/0 NODE_FROM=2 NODE_TO=1 META_DATA_LOCATION=internal drbdadm -- --force dump-md $V > /tmp/md_orig.txt sed -e "s/node-id $NODE_FROM/node-id $NODE_TO/" \ -e "s/^peer.$NODE_FROM. /peer-NEW /" \ -e "s/^peer.$NODE_TO. /peer[$NODE_FROM] /" \ -e "s/^peer-NEW /peer[$NODE_TO] /" \ < /tmp/md_orig.txt > /tmp/md.txt drbdmeta --force $(drbdadm sh-minor $V) v09 $(drbdadm sh-md-dev $V) $META_DATA_LOCATION restore-md /tmp/md.txt
drbdmeta before 8.9.7 cannot cope with out-of-order peer sections. You will need to
exchange the blocks manually by using an editor.
|
-
Bring up the resource on the remote node:
# drbdadm up <resource>
After the two peers connect, they will not initiate a full device
synchronization. Instead, the automatic synchronization that now
commences only covers those blocks that changed since the invocation
of drbdadm --clear-bitmap new-current-uuid
.
Even if there were no changes whatsoever since then, there may still be a brief synchronization period due to areas covered by the Activity Log being rolled back on the new Secondary. This may be mitigated by the use of checksum-based synchronization.
You may use this same procedure regardless of whether the resource is
a regular DRBD resource, or a stacked resource. For stacked resources,
simply add the -S
or --stacked
option to drbdadm
.
5.1.11. Example configuration for four nodes
Here is an example for a four-node cluster.
resource r0 { device minor 0; disk /dev/vg/r0; meta-disk internal; on store1 { address 10.1.10.1:7100; node-id 1; } on store2 { address 10.1.10.2:7100; node-id 2; } on store3 { address 10.1.10.3:7100; node-id 3; } on store4 { address 10.1.10.4:7100; node-id 4; } connection-mesh { hosts store1 store2 store3 store4; } }
In case you want to see the connection-mesh
configuration expanded, try drbdadm dump
<resource> -v
.
As another example, if the four nodes have enough interfaces to provide a complete mesh via direct links[4], you can specify the IP addresses of the interfaces:
resource r0 { ... # store1 has crossover links like 10.99.1x.y connection { host store1 address 10.99.12.1 port 7012; host store2 address 10.99.12.2 port 7021; } connection { host store1 address 10.99.13.1 port 7013; host store3 address 10.99.13.3 port 7031; } connection { host store1 address 10.99.14.1 port 7014; host store4 address 10.99.14.4 port 7041; } # store2 has crossover links like 10.99.2x.y connection { host store2 address 10.99.23.2 port 7023; host store3 address 10.99.23.3 port 7032; } connection { host store2 address 10.99.24.2 port 7024; host store4 address 10.99.24.4 port 7042; } # store3 has crossover links like 10.99.3x.y connection { host store3 address 10.99.34.3 port 7034; host store4 address 10.99.34.4 port 7043; } }
Please note the numbering scheme used for the IP addresses and ports. Another
resource could use the same IP addresses, but ports 71xy
, the next one
72xy
, and so on.
5.2. Checking DRBD Status
5.2.1. Monitoring and Performing Actions on DRBD Resources in Real-time
One convenient way to work with and monitor DRBD is by using the DRBDmon utility. DRBDmon is
included in the drbd-utils
package. To run the utility, enter drbdmon
on a node where the
drbd-utils
package is installed.
DRBDmon is CLI-based but works with the concept of displays, similar to windows, and supports keyboard and mouse navigation. Different displays in DRBDmon show different aspects of DRBD status and activity. For example, one display lists all the DRBD resources and their statuses on the current node. Another display lists peer connections and their statuses for a selected resource. There are other displays for other DRBD components.
Besides being able to get information about the status of DRBD resources, volumes, connections, and other DRBD components, you can also use DRBDmon to perform actions on them. DRBDmon has context-based help text within the utility to help you navigate and use it. DRBDmon is useful for new DRBD users who can benefit from getting status information or performing actions without having to enter CLI commands. The utility is also useful for experienced DRBD users who might be working with a cluster that has a large number of DRBD resources.
5.2.2. Retrieving Status Information Through the DRBD Process File
Monitoring DRBD status by using /proc/drbd is deprecated. We recommend that you switch
to other means, like Retrieving Status Information Using the DRBD Administration Tool, or for even more convenient monitoring,
Retrieving Status Information Using the DRBD Setup Command.
|
/proc/drbd
is a virtual file displaying
basic information about the DRBD module.
It was used extensively up to DRBD 8.4, but couldn’t keep up
with the amount of information provided by DRBD 9.
$ cat /proc/drbd version: 9.0.0 (api:1/proto:86-110) GIT-hash: XXX build by [email protected], 2011-10-12 09:07:35
The first line, prefixed with version:
, shows the DRBD version used
on your system. The second line contains information about this
specific build.
5.2.3. Retrieving Status Information Using the DRBD Administration Tool
In its simplest invocation, we just ask for the status of a single resource.
# drbdadm status home home role:Secondary disk:UpToDate nina role:Secondary disk:UpToDate nino role:Secondary disk:UpToDate nono connection:Connecting
This here just says that the resource home is locally, on ‘nina’, and on ‘nino’ UpToDate and Secondary; so the three nodes have the same data on their storage devices, and nobody is using the device currently.
The node ‘nono’ is not connected, its state is reported as Connecting; please see Connection States below for more details.
You can get more information by passing the --verbose
and/or
--statistics
arguments to drbdsetup
(lines broken for readability):
# drbdsetup status home --verbose --statistics home node-id:1 role:Secondary suspended:no write-ordering:none volume:0 minor:0 disk:UpToDate size:1048412 read:0 written:1048412 al-writes:0 bm-writes:48 upper-pending:0 lower-pending:0 al-suspended:no blocked:no nina local:ipv4:10.9.9.111:7001 peer:ipv4:10.9.9.103:7010 node-id:0 connection:Connected role:Secondary congested:no volume:0 replication:Connected disk:UpToDate resync-suspended:no received:1048412 sent:0 out-of-sync:0 pending:0 unacked:0 nino local:ipv4:10.9.9.111:7021 peer:ipv4:10.9.9.129:7012 node-id:2 connection:Connected role:Secondary congested:no volume:0 replication:Connected disk:UpToDate resync-suspended:no received:0 sent:0 out-of-sync:0 pending:0 unacked:0 nono local:ipv4:10.9.9.111:7013 peer:ipv4:10.9.9.138:7031 node-id:3 connection:Connecting
Every few lines in this example form a block that is repeated for every node used in this resource, with small format exceptions for the local node – see below for more details.
The first line in each block shows the node-id
(for the current
resource; a host can have different node-id
s in different resources).
Furthermore the role
(see Resource Roles) is shown.
The next important line begins with the volume
specification; normally
these are numbered starting by zero, but the configuration may specify
other IDs as well. This line shows the
connection state in the
replication
item (see Connection States for details) and the
remote disk state in disk
(see Disk States).
Then there’s a line for this volume giving a bit of statistics –
data received
, sent
, out-of-sync
, and so on. Please see
Performance Indicators and Connection Information Data for more information.
For the local node the first line shows the resource name, home, in our example. As the first block always describes the local node, there is no Connection or address information.
please see the drbd.conf
manual page for more information.
The other four lines in this example form a block that is repeated for
every DRBD device configured, prefixed by the device minor number. In
this case, this is 0
, corresponding to the device /dev/drbd0
.
The resource-specific output contains various pieces of information about the resource:
5.2.4. Retrieving Status Information Using the DRBD Setup Command
This is available only with userspace versions 8.9.3 and up. |
Using the command drbdsetup events2
with additional options and arguments is a low-level mechanism to get information out of DRBD, suitable for use
in automated tools, like monitoring.
One-shot Monitoring
In its simplest invocation, showing only the current status, the output looks like this (but, when running on a terminal, will include colors):
# drbdsetup events2 --now r0 exists resource name:r0 role:Secondary suspended:no exists connection name:r0 peer-node-id:1 conn-name:remote-host connection:Connected role:Secondary exists device name:r0 volume:0 minor:7 disk:UpToDate exists device name:r0 volume:1 minor:8 disk:UpToDate exists peer-device name:r0 peer-node-id:1 conn-name:remote-host volume:0 replication:Established peer-disk:UpToDate resync-suspended:no exists peer-device name:r0 peer-node-id:1 conn-name:remote-host volume:1 replication:Established peer-disk:UpToDate resync-suspended:no exists -
Real-time Monitoring
Without the ”–now”, the process will keep running, and send continuous updates like this:
# drbdsetup events2 r0 ... change connection name:r0 peer-node-id:1 conn-name:remote-host connection:StandAlone change connection name:r0 peer-node-id:1 conn-name:remote-host connection:Unconnected change connection name:r0 peer-node-id:1 conn-name:remote-host connection:Connecting
Then, for monitoring purposes, there’s another argument ”–statistics”, that will produce some performance counters and other facts:
‘drbdsetup’ verbose output (lines broken for readability):
# drbdsetup events2 --statistics --now r0 exists resource name:r0 role:Secondary suspended:no write-ordering:drain exists connection name:r0 peer-node-id:1 conn-name:remote-host connection:Connected role:Secondary congested:no exists device name:r0 volume:0 minor:7 disk:UpToDate size:6291228 read:6397188 written:131844 al-writes:34 bm-writes:0 upper-pending:0 lower-pending:0 al-suspended:no blocked:no exists device name:r0 volume:1 minor:8 disk:UpToDate size:104854364 read:5910680 written:6634548 al-writes:417 bm-writes:0 upper-pending:0 lower-pending:0 al-suspended:no blocked:no exists peer-device name:r0 peer-node-id:1 conn-name:remote-host volume:0 replication:Established peer-disk:UpToDate resync-suspended:no received:0 sent:131844 out-of-sync:0 pending:0 unacked:0 exists peer-device name:r0 peer-node-id:1 conn-name:remote-host volume:1 replication:Established peer-disk:UpToDate resync-suspended:no received:0 sent:6634548 out-of-sync:0 pending:0 unacked:0 exists -
You might also like the ”–timestamp” parameter.
5.2.5. Connection States
A resource’s connection state can be
observed either by issuing the drbdadm cstate
command:
# drbdadm cstate <resource> Connected Connected StandAlone
If you are interested in only a single connection of a resource, specify the connection name, too:
The default is the peer’s hostname as given in the configuration file.
# drbdadm cstate <resource>:<peer> Connected
A resource may have one of the following connection states:
No network configuration available. The
resource has not yet been connected, or has been administratively
disconnected (using drbdadm disconnect
), or has dropped its
connection due to failed authentication or split brain.
Temporary state during disconnection. The next state is StandAlone.
Temporary state, prior to a connection attempt. Possible next states: Connecting.
Temporary state following a timeout in the communication with the peer. Next state: Unconnected.
Temporary state after the connection to the peer was lost. Next state: Unconnected.
Temporary state after the connection to the partner was lost. Next state: Unconnected.
Temporary state after the connection to the partner was lost. Next state: Unconnected.
Temporary state. The peer is closing the connection. Next state: Unconnected.
This node is waiting until the peer node becomes visible on the network.
A DRBD connection has been established, data mirroring is now active. This is the normal state.
5.2.6. Replication States
Each volume has a replication state for each connection. The possible replication states are:
The volume is not replicated over this connection, since the connection is not Connected.
All writes to that volume are replicated online. This is the normal state.
Full synchronization, initiated by the administrator, is just starting. The next possible states are: SyncSource or PausedSyncS.
Full synchronization, initiated by the administrator, is just starting. Next state: WFSyncUUID.
Partial synchronization is just starting. Next possible states: SyncSource or PausedSyncS.
Partial synchronization is just starting. Next possible state: WFSyncUUID.
Synchronization is about to begin. Next possible states: SyncTarget or PausedSyncT.
Synchronization is currently running, with the local node being the source of synchronization.
Synchronization is currently running, with the local node being the target of synchronization.
The local node is the source of an ongoing
synchronization, but synchronization is currently paused. This may be
due to a dependency on the completion of another synchronization
process, or due to synchronization having been manually interrupted by
drbdadm pause-sync
.
The local node is the target of an ongoing
synchronization, but synchronization is currently paused. This may be
due to a dependency on the completion of another synchronization
process, or due to synchronization having been manually interrupted by
drbdadm pause-sync
.
On-line device verification is currently running, with the local node being the source of verification.
On-line device verification is currently running, with the local node being the target of verification.
Data replication was suspended, since the
link can not cope with the load. This state is enabled by the configuration
on-congestion
option (see Configuring Congestion Policies and Suspended Replication).
Data replication was suspended by the peer,
since the link can not cope with the load. This state is enabled by the configuration
on-congestion
option on the peer node (see Configuring Congestion Policies and Suspended Replication).
5.2.7. Resource Roles
A resource’s role can be observed by issuing the
drbdadm role
command:
# drbdadm role <resource> Primary
You may see one of the following resource roles:
The resource is currently in the primary role, and may be read from and written to. This role only occurs on one of the two nodes, unless dual-primary mode is enabled.
The resource is currently in the secondary role. It normally receives updates from its peer (unless running in disconnected mode), but may neither be read from nor written to. This role may occur on one or both nodes.
The resource’s role is currently unknown. The local resource role never has this status. It is only displayed for the peer’s resource role, and only in disconnected mode.
5.2.8. Disk States
A resource’s disk state can be observed either by
issuing the drbdadm dstate
command:
# drbdadm dstate <resource> UpToDate
The disk state may be one of the following:
No local block device has been assigned to the
DRBD driver. This may mean that the resource has never attached to its
backing device, that it has been manually detached using drbdadm
detach
, or that it automatically detached after a lower-level I/O
error.
Transient state while reading metadata.
Transient state while detaching and waiting for ongoing I/O operations to complete.
Transient state following an I/O failure report by the local block device. Next state: Diskless.
Transient state when an Attach is carried out on an already-Connected DRBD device.
The data is inconsistent. This status occurs immediately upon creation of a new resource, on both nodes (before the initial full sync). Also, this status is found in one node (the synchronization target) during synchronization.
Resource data is consistent, but outdated.
This state is used for the peer disk if no network connection is available.
Consistent data of a node without connection. When the connection is established, it is decided whether the data is UpToDate or Outdated.
Consistent, up-to-date state of the data. This is the normal state.
5.2.9. Connection Information Data
Shows the network family, the local address and port that is used to accept connections from the peer.
Shows the network family, the peer address and port that is used to connect.
This flag tells whether the TCP send buffer of the data connection is more than 80% filled.
5.2.10. Performance Indicators
The command drbdsetup status --verbose --statistics
can be used to show performance statistics.
These are also available in drbdsetup events2 --statistics
,
although there will not be a changed event for every change.
The statistics include the following counters and gauges:
Per volume/device:
Net data read from local disk; in KiB.
Net data written to local disk; in KiB.
Number of updates of the activity log area of the metadata.
Number of updates of the bitmap area of the metadata.
Number of block I/O requests forwarded to DRBD, but not yet answered (completed) by DRBD.
Number of open requests to the local I/O sub-system issued by DRBD.
Shows local I/O congestion.
-
no: No congestion.
-
upper: I/O above the DRBD device is blocked, that is, to the filesystem. Typical causes are
-
I/O suspension by the administrator, see the
suspend-io
command indrbdadm
. -
Transient blocks, for example, during attach/detach.
-
Buffers depleted, see Optimizing DRBD Performance.
-
Waiting for bitmap I/O.
-
-
lower: Backing device is congested.
-
upper,lower: Both upper and lower are blocked.
Per connection:
Application data that is being written by the peer. That is, DRBD has sent it to the peer and is waiting for the acknowledgement that it has been written. In sectors (512 bytes).
Resync data that is being written by the peer. That is, DRBD is SyncSource, has sent data to the peer as part of a resync and is waiting for the acknowledgement that it has been written. In sectors (512 bytes).
Per connection and volume (“peer device”):
Percentage of data synchronized out of the amount to be synchronized.
Whether the resynchronization is currently suspended or not. Possible values are no, user, peer, dependency. Comma separated.
Net data received from the peer; in KiB.
Net data sent to the peer; in KiB.
Amount of data currently out of sync with this peer, according to the bitmap that DRBD has for it; in KiB.
Number of requests sent to the peer, but that have not yet been acknowledged by the peer.
Number of requests received from the peer, but that have not yet been acknowledged by DRBD on this node.
Rate of synchronization within the last few seconds, reported as MiB/seconds. You can affect the synchronization rate by configuring options that are detailed in the Configuring the Rate of Synchronization section of this user’s guide.
Number of seconds remaining for the synchronization to complete. This number is calculated based on the synchronization rate within the last few seconds and the size of the resource’s backing device that remains to be synchronized.
5.3. Enabling and Disabling Resources
5.3.1. Enabling Resources
Normally, all configured DRBD resources are automatically enabled
-
by a cluster resource management application at its discretion, based on your cluster configuration, or
-
by the systemd units (e.g.,
[email protected]
)
If, however, you need to enable resources manually for any reason, you may do so by issuing the command
# drbdadm up <resource>
As always, you may use the keyword all
instead of a specific
resource name if you want to enable all resources configured in
/etc/drbd.conf
at once.
5.4. Reconfiguring Resources
DRBD allows you to reconfigure resources while they are operational. To that end,
-
make any necessary changes to the resource configuration in
/etc/drbd.conf
, -
synchronize your
/etc/drbd.conf
file between both nodes, -
issue the
drbdadm adjust <resource>
command on both nodes.
drbdadm adjust
then hands off to drbdsetup
to make the necessary
adjustments to the configuration. As always, you are able to review
the pending drbdsetup
invocations by running drbdadm
with the
-d
(dry-run) option.
When making changes to the common section in /etc/drbd.conf ,
you can adjust the configuration for all resources in one run, by
issuing drbdadm adjust all .
|
5.5. Promoting and Demoting Resources
Manually switching a resource’s role from secondary to primary (promotion) or vice versa (demotion) is done using one of the following commands:
# drbdadm primary <resource> # drbdadm secondary <resource>
In single-primary mode (DRBD’s default), any
resource can be in the primary role on only one node at any given time
while the connection state is
Connected. Therefore, issuing drbdadm primary <resource>
on one node
while the specified resource is still in the primary role on another node will
result in an error.
A resource configured to allow dual-primary mode can be switched to the primary role on two nodes; this is, for example, needed for online migration of virtual machines.
5.6. Basic Manual Failover
If not using a cluster manager and looking to handle failovers manually in a passive/active configuration, the process is as follows.
On the current primary node, stop any applications or services using the DRBD device, unmount the DRBD device, and demote the resource to secondary.
# umount /dev/drbd/by-res/<resource>/<vol-nr> # drbdadm secondary <resource>
Now on the node you want to make primary promote the resource and mount the device.
# drbdadm primary <resource> # mount /dev/drbd/by-res/<resource>/<vol-nr> <mountpoint>
If you’re using the auto-promote
feature, you don’t need to change the roles
(Primary/Secondary) manually; only stopping of the services and unmounting,
respectively mounting, is necessary.
5.7. Shutting Down Gracefully By Using a systemd Service
Included with drbd-utils
versions since 9.26.0, there is a “graceful shutdown” service,
drbd-graceful-shutdown.service
. This service ensures that the DRBD
“last man standing” behavior applies to your cluster when
shutting down nodes.
The graceful shutdown service is started automatically by the udev service at the moment the first DRBD device is created. At system shutdown, the graceful shutdown service ensures that all nodes hosting DRBD resources shut down services in a proper sequence so that the last node keeps quorum.
During a normal system shutdown sequence without this intervention, networking often stops before a system unmounts file systems and takes down DRBD devices. Without the graceful shutdown service, to make it so that the last node to shut down keeps DRBD quorum, you would need to manually unmount file systems and down DRBD resources before shutting down nodes.
By virtue of the graceful shutdown service running on a node, when shutting down, a node marks its DRBD resources outdated, stops its running DRBD services, and then networking services are allowed to stop. This shutdown sequence makes it so that the DRBD peer nodes leaving the cluster are able to communicate their outdated status over the network to the last node, and in this way the last node to leave the cluster will keep quorum.
5.8. Upgrading DRBD
Upgrading DRBD is a fairly simple process. This section contains warnings or important information regarding upgrading to a particular DRBD 9 version from another DRBD 9 version.
If you are upgrading DRBD from 8.4.x to 9.x, refer to the instructions within the Appendix.
5.8.1. Upgrading to DRBD 9.2.x
If you are upgrading to DRBD 9.2.x from an earlier version not on the 9.2 branch, you will need
to pay attention to the names of your resources. DRBD 9.2.x enforces strict naming conventions
for DRBD resources. By default, DRBD 9.2.x accepts only alphanumeric, .
, +
, , and
-
characters in resource names (regular expression: [0-9A-Za-z.+-]*
). If you depend on the old
behavior, it can be brought back by disabling strict name checking:
# echo 0 > /sys/module/drbd/parameters/strict_names
5.8.2. Upgrading from DRBD 9.0.x
Due to an issue in the wire protocol code in DRBD versions on the 9.1 and 9.2 branches, you will not be able to upgrade DRBD 9.0 to 9.1 or 9.2 unless you are on version 9.0.26 or higher. If you are on an earlier 9.0 version, first upgrade to the latest DRBD 9.0 bug fix version, then continue upgrading to 9.1 or 9.2.
Even if you are on version 9.0.26 or higher, it still might be safest to first upgrade to the latest 9.0 version, before upgrading to a higher minor version. The latest 9.0 version is periodically exercised in continuous integration testing and therefore will be the safest 9.0 version to upgrade from. |
5.8.3. Compatibility
DRBD is wire protocol compatible over minor versions, with the exception of DRBD 9.0 versions older than 9.0.26. The DRBD wire protocol is independent of the host kernel version and the machines’ CPU architectures.
DRBD is protocol compatible within a major number. For example, all version 9.x.y releases are protocol compatible.
5.8.4. Upgrading Within DRBD 9
If you are already running DRBD 9.x, you can upgrade to a newer DRBD 9 version by following these steps:
-
Verify that DRBD resources are synchronized, by checking the DRBD state.
-
Stop the DRBD service or, if you are using a cluster manager, put the cluster node that you are upgrading into standby.
-
Unload and then reload the new kernel module.
-
Start the DRBD resources and bring the cluster node online again if you are using a cluster manager.
These individual steps are detailed below.
Checking the DRBD State
Before you update DRBD, verify that your resources are synchronized. The output of drbdadm
status all
should show an UpToDate status for your resources, as shown for an example
resource (data
) below:
# drbdadm status all data role:Secondary disk:UpToDate node-1 role:Primary peer-disk:UpToDate
Upgrading the Packages
If you are ready to upgrade DRBD within version 9, first upgrade your packages.
RPM-based:
# dnf -y upgrade
DEB-based:
# apt update && apt -y upgrade
Once the upgrade is finished you will have the latest DRBD 9.x kernel module and drbd-utils
installed. However, the new kernel module is not active yet. Before you make the new kernel
module active, you must first pause your cluster services.
Pausing the Services
You can pause your cluster services manually or according to your cluster manager’s documentation. Both processes are covered below. If you are running Pacemaker as your cluster manager do not use the manual method.
Loading the New Kernel Module
After pausing your cluster services, the DRBD module should not be in use anymore, so unload it by entering the following command:
# rmmod drbd_transport_tcp; rmmod drbd
If there is a message like ERROR: Module drbd is in use
, then not all
resources have been correctly stopped.
Retry upgrading the packages, or run the command drbdadm down all
to find
out which resources are still active.
Some typical issues that might prevent you from unloading the kernel module are:
-
NFS export on a DRBD-backed filesystem (see
exportfs -v
output) -
File system still mounted – check
grep drbd /proc/mounts
-
Loopback device active (
losetup -l
) -
Device mapper using DRBD, directly or indirectly (
dmsetup ls --tree
) -
LVM with a DRBD-PV (
pvs
)
This list is not complete. These are just the most common examples. |
Now you can load the new DRBD module.
# modprobe drbd
Next, you can verify that the version of the DRBD kernel module that is loaded is the updated
9.x.y version. The output of drbdadm --version
should show the 9.x.y version that you are
expecting to upgrade to and look similar to this:
DRBDADM_BUILDTAG=GIT-hash: [...] build\ by\ buildd@lcy02-amd64-080\,\ 2023-03-14\ 10:21:20 DRBDADM_API_VERSION=2 DRBD_KERNEL_VERSION_CODE=0x090202 DRBD_KERNEL_VERSION=9.2.2 DRBDADM_VERSION_CODE=0x091701 DRBDADM_VERSION=9.23.1
Starting the DRBD Resources Again
Now, the only thing left to do is to get the DRBD devices up and running again. You can do this by using the drbdadm up all
command.
Next, depending on whether you are using a cluster manager or if you are managing your DRBD resources manually, there are two different ways to bring up your resources. If you are using a cluster manager follow its documentation.
-
Manually
# systemctl start drbd@<resource>.target
-
Pacemaker
# crm node online node-2
This should make DRBD connect to the other node, and the resynchronization process will start.
When the two nodes are UpToDate on all resources again, you can move your applications to the already upgraded node, and then follow the same steps on the next cluster node that you want to upgrade.
5.9. Enabling dual-primary mode
Dual-primary mode allows a resource to assume the primary role simultaneously on more than one node. Doing so is possible on either a permanent or a temporary basis.
Dual-primary mode requires that the resource is configured to replicate synchronously (protocol C). Because of this it is latency sensitive, and ill-suited for WAN environments. Additionally, as both resources are always primary, any interruption in the network between nodes will result in a split-brain. |
In DRBD 9.0.x Dual-Primary mode is limited to exactly two Primaries for the use in live migration. |
5.9.1. Permanent dual-primary mode
To enable dual-primary mode, set the
allow-two-primaries
option to yes
in the net
section of your
resource configuration:
resource <resource>
net {
protocol C;
allow-two-primaries yes;
fencing resource-and-stonith;
}
handlers {
fence-peer "...";
unfence-peer "...";
}
...
}
After that, do not forget to synchronize the configuration between nodes. Run
drbdadm adjust <resource>
on both nodes.
You can now change both nodes to role primary at the same time with
drbdadm primary <resource>
.
You should always implement suitable fencing policies. Using ‘allow-two-primaries’ without fencing is a very bad idea, even worse than using single-primary without fencing. |
5.9.2. Temporary dual-primary mode
To temporarily enable dual-primary mode for a resource normally running in a single-primary configuration, issue the following command:
# drbdadm net-options --protocol=C --allow-two-primaries <resource>
To end temporary dual-primary mode, run the same command as above but with
--allow-two-primaries=no
(and your desired replication protocol, if
applicable).
5.10. Using Online Device Verification
5.10.1. Enabling Online Verification
Online
device verification for resources is not enabled by default. To
enable it, add the following lines to your resource configuration in
/etc/drbd.conf
:
resource <resource>
net {
verify-alg <algorithm>;
}
...
}
<algorithm> may be any message digest algorithm supported by the
kernel crypto API in your system’s kernel configuration. Normally, you
should be able to choose at least from sha1
, md5
, and crc32c
.
If you make this change to an existing resource, as always,
synchronize your drbd.conf
to the peer, and run drbdadm adjust
<resource>
on both nodes.
5.10.2. Invoking Online Verification
After you have enabled online verification, you will be able to initiate a verification run using the following command:
# drbdadm verify <resource>:<peer>/<volume>
When you do so, DRBD starts an online verification run for <volume> to <peer> in <resource>, and if it detects any blocks that are not in sync, will mark those blocks as such and write a message to the kernel log. Any applications using the device at that time can continue to do so unimpeded, and you may also switch resource roles at will.
<Volume> is optional, if omitted, it will verify all volumes in that resource.
If out-of-sync blocks were detected during the verification run, you may resynchronize them using the following commands after verification has completed.
Since drbd-9.0.29 the preferred way is one of these commands:
# drbdadm invalidate <resource>:<peer>/volume --reset-bitmap=no # drbdadm invalidate-remote <resource>:<peer>/volume --reset-bitmap=no
The first command will cause the local differences to be overwritten by the remote version. The second command does it in the opposite direction.
Before drbd-9.0.29 one needs to initiate a resync. A way to do that is disconnecting from a primary and ensuring that the primary changes at least one block while the peer is away.
# drbdadm disconnect <resource>:<peer> ## write one block on the primary # drbdadm connect <resource>:<peer>
5.10.3. Automating Online Verification
Most users will want to
automate online device verification. This can be easily
accomplished. Create a file with the following contents, named
/etc/cron.d/drbd-verify
on one of your nodes:
42 0 * * 0 root /sbin/drbdadm verify <resource>
This will have cron
invoke a device verification every Sunday at 42
minutes past midnight; so, if you come into the office on Monday morning,
a quick examination of the resource’s status would show the result. If your devices
are very big, and the \~32 hours were not enough, then you’ll notice VerifyS or
VerifyT as connection state, meaning that the verify
is still in progress.
If you have enabled online verification for all your resources (for
example, by adding verify-alg <algorithm>
to the common
section
in /etc/drbd.d/global_common.conf
), you may also use:
42 0 * * 0 root /sbin/drbdadm verify all
5.11. Configuring the Rate of Synchronization
Normally, one tries to ensure that background synchronization (which makes the data on the synchronization target temporarily inconsistent) completes as quickly as possible. However, it is also necessary to keep background synchronization from hogging all bandwidth otherwise available for foreground replication, which would be detrimental to application performance. Therefore, you must configure the synchronization bandwidth to match your hardware — which you may do in a permanent fashion or on-the-fly.
It does not make sense to set a synchronization rate that is higher than the maximum write throughput on your secondary node. You must not expect your secondary node to miraculously be able to write faster than its I/O subsystem allows, just because it happens to be the target of an ongoing device synchronization. |
Likewise, and for the same reasons, it does not make sense to set a synchronization rate that is higher than the bandwidth available on the replication network.
5.11.1. Estimating a Synchronization Speed
A good rule for this value is to use about 30% of the available replication bandwidth. Therefore, if you had an I/O subsystem capable of sustaining write throughput of 400MB/s, and a Gigabit Ethernet network capable of sustaining 110 MB/s network throughput (the network being the bottleneck), you would calculate: |
Therefore, the recommended value for the rate
option would be 33M
.
By contrast, if you had an I/O subsystem with a maximum throughput of 80MB/s and a Gigabit Ethernet connection (the I/O subsystem being the bottleneck), you would calculate:
In this case, the recommended value for the rate
option would be
24M
.
Similarly, for a storage speed of 800MB/s and a 10Gbe network connection, you would shoot for \~240MB/s synchronization rate.
5.11.2. Variable Synchronization Rate Configuration
When multiple DRBD resources share a single replication/synchronization network, synchronization with a fixed rate may not be an optimal approach. So, in DRBD 8.4.0 the variable-rate synchronization was enabled by default. In this mode, DRBD uses an automated control loop algorithm to determine, and adjust, the synchronization rate. This algorithm ensures that there is always sufficient bandwidth available for foreground replication, greatly mitigating the impact that background synchronization has on foreground I/O.
The optimal configuration for variable-rate synchronization may vary greatly depending on the available network bandwidth, application I/O pattern and link congestion. Ideal configuration settings also depend on whether DRBD Proxy is in use or not. It may be wise to engage professional consultancy to optimally configure this DRBD feature. An example configuration (which assumes a deployment in conjunction with DRBD Proxy) is provided below:
resource <resource> {
disk {
c-plan-ahead 5;
c-max-rate 10M;
c-fill-target 2M;
}
}
A good starting value for c-fill-target is BDP * 2, where
BDP is your bandwidth-delay-product on the replication link.
|
For example, when using a 1GBit/s crossover connection, you’ll end up with
about 200µs latency[5].
1GBit/s means about 120MB/s; times 200*10-6 seconds gives 24000 Byte. Just round that value up to the next MB, and you’re good to go.
Another example: a 100MBit WAN connection with 200ms latency means 12MB/s times
0.2s, or about 2.5MB “on the wire”. Here a good starting value for
c-fill-target
would be 3MB.
Please see the drbd.conf
manual page for more details on the other
configuration items.
5.11.3. Permanent Fixed Synchronization Rate Configuration
In a few, very restricted situations[6], it might
make sense to just use some fixed synchronization rate. In this case, first
of all you need to turn the dynamic sync rate controller off, by using
c-plan-ahead 0;
.
Then, the maximum bandwidth a resource uses for background
re-synchronization is determined by the resync-rate
option
for a resource. This must be included in the resource’s
disk
section in /etc/drbd.conf
:
resource <resource>
disk {
resync-rate 40M;
...
}
...
}
Note that the rate setting is given in bytes, not bits per second; the
default unit is Kibibyte, so a value of 4096
would be interpreted as 4MiB
.
This just defines a rate that DRBD tries to achieve. If there is a bottleneck with lower throughput (network, storage speed), the defined speed (aka the “wished-for” performance 😉 won’t be reached. |
5.11.4. Further Synchronization Hints
When some amount of the to-be-synchronized data isn’t really in use anymore (for example, because files got deleted while one node wasn’t connected), you might benefit from the Trim and Discard Support.
Furthermore, c-min-rate
is easy to misunderstand – it doesn’t define
a minimum synchronization speed, but rather a limit below which DRBD will not
slow down further on purpose.
Whether you manage to reach that synchronization rate depends on your network
and storage speed, network latency (which might be highly variable for shared
links), and application I/O (which you might not be able to do anything about).
5.12. Configuring Checksum-based Synchronization
Checksum-based synchronization is
not enabled for resources by default. To enable it, add the following
lines to your resource configuration in /etc/drbd.conf
:
resource <resource>
net {
csums-alg <algorithm>;
}
...
}
<algorithm> may be any message digest algorithm supported by the
kernel crypto API in your system’s kernel configuration. Normally, you
should be able to choose at least from sha1
, md5
, and crc32c
.
If you make this change to an existing resource, as always,
synchronize your drbd.conf
to the peer, and run drbdadm adjust
<resource>
on both nodes.
5.13. Configuring Congestion Policies and Suspended Replication
In an environment where the replication bandwidth is highly variable (as would be typical in WAN replication setups), the replication link may occasionally become congested. In a default configuration, this would cause I/O on the primary node to block, which is sometimes undesirable.
Instead, you may configure DRBD to suspend the ongoing replication in this case, causing the Primary’s data set to pull ahead of the Secondary. In this mode, DRBD keeps the replication channel open — it never switches to disconnected mode — but does not actually replicate until sufficient bandwidth becomes available again.
The following example is for a DRBD Proxy configuration:
resource <resource> {
net {
on-congestion pull-ahead;
congestion-fill 2G;
congestion-extents 2000;
...
}
...
}
It is usually wise to set both congestion-fill
and
congestion-extents
together with the pull-ahead
option.
A good value for congestion-fill
is 90%
-
of the allocated DRBD proxy buffer memory, when replicating over DRBD Proxy, or
-
of the TCP network send buffer, in non-DRBD Proxy setups.
A good value for congestion-extents
is 90% of your configured
al-extents
for the affected resources.
5.14. Configuring I/O Error Handling Strategies
DRBD’s
strategy for handling lower-level I/O
errors is determined by the on-io-error
option, included in the
resource disk
configuration in /etc/drbd.conf
:
resource <resource> {
disk {
on-io-error <strategy>;
...
}
...
}
You may, of course, set this in the common
section too, if you want
to define a global I/O error handling policy for all resources.
The on-io-error
option is independent from the on-no-data-accessible
option. Some
on-io-error
strategies involve retrying I/O requests on peer disks. The
on-no-data-accessible
option setting dictates DRBD behavior when I/O requests are unsuccessful
on all disks.
You can set the on-io-error
<strategy> to one of the following:
detach
This is the default and recommended option. On the occurrence of a lower-level I/O error, the node drops its backing device, and continues in diskless mode.
This causes DRBD to change the disk status to Inconsistent, mark the failed block as inconsistent in the DRBD quick-sync bitmap, and retry the I/O operation on a peer (secondary node) disk. If the I/O operation succeeds on at least one peer, then a write operation is considered successful. DRBD will retry a read operation until there are no more peer disks to try to read from.
call-local-io-error
Invokes the command defined as the local I/O error handler. This
requires that a corresponding local-io-error
command invocation is
defined in the resource’s handlers
section. It is entirely left to
the administrator’s discretion to implement I/O error handling using
the command (or script) invoked by local-io-error
.
Early DRBD versions (prior to 8.0) included another option,
panic , which would forcibly remove the node from the cluster by way
of a kernel panic, whenever a local I/O error occurred. While that
option is no longer available, the same behavior may be mimicked through
the local-io-error /call-local-io-error interface. You should do so
only if you fully understand the implications of such behavior.
|
You may reconfigure a running resource’s I/O error handling strategy by following this process:
-
Edit the resource configuration in
/etc/drbd.d/<resource>.res
. -
Copy the configuration to the peer node.
-
Issue
drbdadm adjust <resource>
on both nodes.
5.15. Configuring Replication Traffic Integrity Checking
Replication traffic integrity checking
is not enabled for resources by default. To enable it, add the
following lines to your resource configuration in /etc/drbd.conf
:
resource <resource>
net {
data-integrity-alg <algorithm>;
}
...
}
<algorithm> may be any message digest algorithm supported by the
kernel crypto API in your system’s kernel configuration. Normally, you
should be able to choose at least from sha1
, md5
, and crc32c
.
If you make this change to an existing resource, as always,
synchronize your drbd.conf
to the peer, and run drbdadm adjust
<resource>
on both nodes.
This feature is not intended for production use. Enable only if you need to diagnose data corruption problems, and want to see whether the transport path (network hardware, drivers, switches) might be at fault! |
5.16. Resizing Resources
When growing (extending) DRBD volumes, you need to grow from bottom to top. First, you need to extend the backing block devices on all nodes. Then you can tell DRBD to use the new space.
Once the DRBD volume is extended, you still need to propagate that change into whatever upper layers might be using the DRBD volume, for example, by extending the file system, or making a VM running with this volume attached aware of the new “disk size”.
Doing this typically means taking the following steps:
-
On all nodes, resize the backing logical volume, for example, when using LVM:
# lvextend -L +${additional_gb}g VG/LV
-
On one node, resize the DRBD volume:
# drbdadm resize ${resource_name}/${volume_number}
-
On the DRBD primary node only, resize the file system by using the tool specific to the file system. Refer to the details below and details in the Growing Online section.
Note that different file systems have different capabilities and different sets
of management tools. For example XFS can only grow.
You point its tool to the active mount point: xfs_growfs /where/you/have/it/mounted
.
While the EXT family can both grow (even online), and also shrink (only
offline; you have to unmount it first). To resize an ext3 or ext4,
you would point the tool not to the mount point, but to the (mounted)
block device: resize2fs /dev/drbd#
Obviously use the correct DRBD (as displayed by mount
or df -T
, while mounted),
and not the backing block device. If DRBD is up, that’s not supposed
to work anyways (resize2fs: Device or resource busy while trying to open
/dev/mapper/VG-LV Couldn’t find valid filesystem superblock.
).
If you tried to do that offline (with DRBD stopped), you may corrupt DRBD
metadata if you ran the file system tools directly against the backing LV or
partition. So don’t.
You do the file system resize only once on the Primary, against the active DRBD device. DRBD replicates the changes to the file system structure. That is what you have it for.
Also, don’t use resize2fs on XFS volumes, or XFS tools on EXT, or … but the right tool for the file system in use.
resize2fs: Bad magic number in super-block while trying to open /dev/drbd7
is probably just trying to tell you that this is not an EXT file system,
and you should try an other tool instead. Maybe xfs_growfs
? But as mentioned,
that does not take the block device, but the mount point as argument.
When shrinking (reducing) DRBD volumes, you need to shrink from top to bottom. So first verify that no one is using the space you want to cut off. Next, shrink the file system (if your file system supports that). Then tell DRBD to stop using that space, which is not so easy with DRBD internal metadata, because they are by design “at the end” of the backing device.
Once you are sure that DRBD won’t use the space anymore either,
you can cut it off from the backing device, for example using lvreduce
.
See also Shrinking Online, Shrinking Offline.
5.16.1. Growing Online
If the backing block devices can be grown while in operation (online), it is also possible to increase the size of a DRBD device based on these devices during operation. To do so, two criteria must be fulfilled:
-
The affected resource’s backing device must be one managed by a logical volume management subsystem, such as LVM.
-
The resource must currently be in the Connected connection state.
Having grown the backing block devices on all nodes, ensure that only one node is in primary state. Then enter on one node:
# drbdadm resize <resource>
This triggers a synchronization of the new section. The synchronization is done from the primary node to the secondary node.
If the space you’re adding is clean, you can skip syncing the additional space by using the –assume-clean option.
# drbdadm -- --assume-clean resize <resource>
5.16.2. Growing Offline
When the backing block devices on both nodes are grown while DRBD is inactive, and the DRBD resource is using external metadata, then the new size is recognized automatically. No administrative intervention is necessary. The DRBD device will have the new size after the next activation of DRBD on both nodes and a successful establishment of a network connection.
If however the DRBD resource is configured to use internal metadata, then this metadata must be moved to the end of the grown device before the new size becomes available. To do so, complete the following steps:
This is an advanced procedure. Use at your own discretion. |
-
Unconfigure your DRBD resource:
# drbdadm down <resource>
-
Save the metadata in a text file prior to resizing:
# drbdadm dump-md <resource> > /tmp/metadata
You must do this on both nodes, using a separate dump file for every node. Do not dump the metadata on one node, and simply copy the dump file to the peer. This. will. not. work.
-
Grow the backing block device on both nodes.
-
Adjust the size information (
la-size-sect
) in the file/tmp/metadata
accordingly, on both nodes. Remember thatla-size-sect
must be specified in sectors. -
Re-initialize the metadata area:
# drbdadm create-md <resource>
-
Re-import the corrected metadata, on both nodes:
# drbdmeta_cmd=$(drbdadm -d dump-md <resource>) # ${drbdmeta_cmd/dump-md/restore-md} /tmp/metadata Valid meta-data in place, overwrite? [need to type 'yes' to confirm] yes Successfully restored meta data
This example uses bash parameter substitution. It may or may
not work in other shells. Check your SHELL environment variable if
you are unsure which shell you are currently using.
|
-
Re-enable your DRBD resource:
# drbdadm up <resource>
-
On one node, promote the DRBD resource:
# drbdadm primary <resource>
-
Finally, grow the file system so it fills the extended size of the DRBD device.
5.16.3. Shrinking Online
Online shrinking is only supported with external metadata. |
Before shrinking a DRBD device, you must shrink the layers above DRBD (usually the file system). Since DRBD cannot ask the file system how much space it actually uses, you have to be careful to not cause data loss.
Whether or not the filesystem can be shrunk online depends on the filesystem being used. Most filesystems do not support online shrinking. XFS does not support shrinking at all. |
To shrink DRBD online, issue the following command after you have shrunk the file system residing on top of it:
# drbdadm resize --size=<new-size> <resource>
You may use the usual multiplier suffixes for <new-size> (K, M, G, and so on). After you have shrunk DRBD, you may also shrink the containing block device (if it supports shrinking).
It might be a good idea to issue drbdadm resize <resource> after
resizing the lower level device, so that the DRBD metadata really gets
written into the expected space at the end of the volume.
|
5.16.4. Shrinking Offline
If you were to shrink a backing block device while DRBD is inactive, DRBD would refuse to attach to this block device during the next attach attempt, because the block device would now be too small (if external metadata was in use), or it would be unable to find its metadata (if internal metadata was in use because DRBD metadata is written to the end of the backing block device). To work around these issues, use this procedure (if you cannot use online shrinking):
This is an advanced procedure. Use at your own discretion. |
-
Shrink the file system from one node, while DRBD is still configured.
-
Unconfigure your DRBD resource:
# drbdadm down <resource>
-
Save the metadata in a text file prior to shrinking:
# drbdadm dump-md <resource> > /tmp/<resource>-metadata
If the dump-md
command fails with a warning about “unclean” metadata, you will first need to run the commanddrbdadm apply-al <resource>
to apply the activity log of the specified resource. You can then retry thedump-md
command.You must dump the metadata on all nodes that are configured for the DRBD resource, by using a separate dump file for each node.
Do not dump the metadata on one node and then simply copy the dump file to peer nodes. This. Will. Not. Work. -
Shrink the backing block device on each node configured for the DRBD resource.
-
Adjust the size information (
la-size-sect
) in the file/tmp/<resource>-metadata
accordingly, on each node. Remember thatla-size-sect
must be specified in sectors. -
Only if you are using internal metadata (which at this time have probably been lost due to the shrinking process), re-initialize the metadata area:
# drbdadm create-md <resource>
-
Reimport the corrected metadata, on each node:
# drbdmeta_cmd=$(drbdadm --dry-run dump-md <resource>) # ${drbdmeta_cmd/dump-md/restore-md} /tmp/<resource>-metadata Valid meta-data in place, overwrite? [need to type 'yes' to confirm] yes reinitializing Successfully restored meta data
This example uses BASH parameter substitution to generate the drbdmeta restore-md
command necessary to restore the modified metadata for the resource. It might not work in other shells. Check yourSHELL
environment variable if you are unsure which shell you are currently using. -
Re-enable your DRBD resource:
# drbdadm up <resource>
5.17. Disabling Backing Device Flushes
You should only disable device flushes when running DRBD on devices with a battery-backed write cache (BBWC). Most storage controllers allow to automatically disable the write cache when the battery is depleted, switching to write-through mode when the battery dies. It is strongly recommended to enable such a feature. |
Disabling DRBD’s flushes when running without BBWC, or on BBWC with a depleted battery, is likely to cause data loss and should not be attempted.
DRBD allows you to enable and disable backing
device flushes separately for the replicated data set and DRBD’s own
metadata. Both of these options are enabled by default. If you want
to disable either (or both), you would set this in the disk
section
for the DRBD configuration file, /etc/drbd.conf
.
To disable disk flushes for the replicated data set, include the following line in your configuration:
resource <resource>
disk {
disk-flushes no;
...
}
...
}
To disable disk flushes on DRBD’s metadata, include the following line:
resource <resource>
disk {
md-flushes no;
...
}
...
}
After you have modified your resource configuration (and synchronized
your /etc/drbd.conf
between nodes, of course), you may enable these
settings by issuing this command on both nodes:
# drbdadm adjust <resource>
If only one of the servers has a BBWC[7], you should move the setting into a host section, like this:
resource <resource> {
disk {
... common settings ...
}
on host-1 {
disk {
md-flushes no;
}
...
}
...
}
5.18. Configuring Split Brain Behavior
5.18.1. Split Brain Notification
DRBD invokes the split-brain
handler, if configured, at any time
split brain is detected. To configure this handler, add the
following item to your resource configuration:
resource <resource> handlers { split-brain <handler>; ... } ... }
<handler> may be any executable present on the system.
The DRBD distribution contains a split brain handler script that
installs as /usr/lib/drbd/notify-split-brain.sh
. It simply sends a
notification e-mail message to a specified address. To configure the
handler to send a message to root@localhost
(which is expected to be
an email address that forwards the notification to a real system
administrator), configure the split-brain handler
as follows:
resource <resource> handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root"; ... } ... }
After you have made this modification on a running resource (and synchronized the configuration file between nodes), no additional intervention is needed to enable the handler. DRBD will simply invoke the newly-configured handler on the next occurrence of split brain.
5.18.2. Automatic Split Brain Recovery Policies
Configuring DRBD to automatically resolve data divergence situations resulting from split-brain (or other) scenarios is configuring for potential automatic data loss. Understand the implications, and don’t do it if you don’t mean to. |
You rather want to look into fencing policies, quorum settings, cluster manager integration, and redundant cluster manager communication links to avoid data divergence in the first place. |
To be able to enable and configure DRBD’s automatic split
brain recovery policies, you must understand that DRBD offers several
configuration options for this purpose. DRBD applies its split brain
recovery procedures based on the number of nodes in the Primary role
at the time the split brain is detected. To that end, DRBD examines
the following keywords, all found in the resource’s net
configuration
section:
after-sb-0pri
Split brain has just been detected, but at this time the resource is not in the Primary role on any host. For this option, DRBD understands the following keywords:
-
disconnect
: Do not recover automatically, simply invoke thesplit-brain
handler script (if configured), drop the connection and continue in disconnected mode. -
discard-younger-primary
: Discard and roll back the modifications made on the host which assumed the Primary role last. -
discard-least-changes
: Discard and roll back the modifications on the host where fewer changes occurred. -
discard-zero-changes
: If there is any host on which no changes occurred at all, simply apply all modifications made on the other and continue.
after-sb-1pri
Split brain has just been detected, and at this time the resource is in the Primary role on one host. For this option, DRBD understands the following keywords:
-
disconnect
: As withafter-sb-0pri
, simply invoke thesplit-brain
handler script (if configured), drop the connection and continue in disconnected mode. -
consensus
: Apply the same recovery policies as specified inafter-sb-0pri
. If a split brain victim can be selected after applying these policies, automatically resolve. Otherwise, behave exactly as ifdisconnect
were specified. -
call-pri-lost-after-sb
: Apply the recovery policies as specified inafter-sb-0pri
. If a split brain victim can be selected after applying these policies, invoke thepri-lost-after-sb
handler on the victim node. This handler must be configured in thehandlers
section and is expected to forcibly remove the node from the cluster. -
discard-secondary
: Whichever host is currently in the Secondary role, make that host the split brain victim.
after-sb-2pri
Split brain has just been detected, and at this time the resource is
in the Primary role on both hosts. This option accepts the same
keywords as after-sb-1pri
except discard-secondary
and consensus
.
DRBD understands additional keywords for these three options,
which have been omitted here because they are very rarely used. Refer
to the man page of drbd.conf for details on split brain recovery keywords not
discussed here.
|
For example, a resource which serves as the block device for a GFS or OCFS2 file system in dual-Primary mode may have its recovery policy defined as follows:
resource <resource> { handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root" ... } net { after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; ... } ... }
5.19. Creating a Stacked Three-node Setup
A three-node setup involves one DRBD device stacked atop another.
Stacking is deprecated in DRBD version 9.x, as more nodes can be implemented on a single level. See Defining network connections for details. |
5.19.1. Device Stacking Considerations
The following considerations apply to this type of setup:
-
The stacked device is the active one. Assume you have configured one DRBD device
/dev/drbd0
, and the stacked device atop it is/dev/drbd10
, then/dev/drbd10
will be the device that you mount and use. -
Device metadata will be stored twice, on the underlying DRBD device and the stacked DRBD device. On the stacked device, you must always use internal metadata. This means that the effectively available storage area on a stacked device is slightly smaller, compared to an unstacked device.
-
To get the stacked upper level device running, the underlying device must be in the primary role.
-
To be able to synchronize the backup node, the stacked device on the active node must be up and in the primary role.
5.19.2. Configuring a Stacked Resource
In the following example, nodes are named ‘alice’, ‘bob’, and ‘charlie’, with ‘alice’ and ‘bob’ forming a two-node cluster, and ‘charlie’ being the backup node.
resource r0 {
protocol C;
device /dev/drbd0;
disk /dev/sda6;
meta-disk internal;
on alice {
address 10.0.0.1:7788;
}
on bob {
address 10.0.0.2:7788;
}
}
resource r0-U {
protocol A;
stacked-on-top-of r0 {
device /dev/drbd10;
address 192.168.42.1:7789;
}
on charlie {
device /dev/drbd10;
disk /dev/hda6;
address 192.168.42.2:7789; # Public IP of the backup node
meta-disk internal;
}
}
As with any drbd.conf
configuration file, this must be distributed
across all nodes in the cluster — in this case, three nodes. Notice
the following extra keyword not found in an unstacked resource
configuration:
stacked-on-top-of
This option informs DRBD that the resource which contains it is a
stacked resource. It replaces one of the on
sections normally found
in any resource configuration. Do not use stacked-on-top-of
in an
lower-level resource.
It is not a requirement to use Protocol A for stacked resources. You may select any of DRBD’s replication protocols depending on your application. |
5.19.3. Enabling Stacked Resources
To enable a stacked resource, you first enable its lower-level resource and promote it:
drbdadm up r0 drbdadm primary r0
As with unstacked resources, you must create DRBD metadata on the stacked resources. This is done using the following command:
# drbdadm create-md --stacked r0-U
Then, you may enable the stacked resource:
# drbdadm up --stacked r0-U # drbdadm primary --stacked r0-U
After this, you may bring up the resource on the backup node, enabling three-node replication:
# drbdadm create-md r0-U # drbdadm up r0-U
To automate stacked resource management, you may integrate stacked resources in your cluster manager configuration.
5.20. Permanently Diskless Nodes
A node might be permanently diskless in DRBD. Here is a configuration example showing a resource with 3 diskful nodes (servers) and one permanently diskless node (client).
resource kvm-mail { device /dev/drbd6; disk /dev/vg/kvm-mail; meta-disk internal; on store1 { address 10.1.10.1:7006; node-id 0; } on store2 { address 10.1.10.2:7006; node-id 1; } on store3 { address 10.1.10.3:7006; node-id 2; } on for-later-rebalancing { address 10.1.10.4:7006; node-id 3; } # DRBD "client" floating 10.1.11.6:8006 { disk none; node-id 4; } # rest omitted for brevity ... }
For permanently diskless nodes no bitmap slot gets allocated. For such nodes the diskless status is displayed in green color since it is not an error or unexpected state. See The Client Mode for internal details.
5.21. Data Rebalancing
Given the (example) policy that data needs to be available on 3 nodes, you need at least 3 servers for your setup.
Now, as your storage demands grow, you will encounter the need for additional servers. Rather than having to buy 3 more servers at the same time, you can rebalance your data across a single additional node.
In the figure above you can see the before and after states: from 3 nodes with three 25TiB volumes each (for a net 75TiB), to 4 nodes, with net 100TiB.
To redistribute the data across your cluster you have to choose a new node,
and one where you want to remove this DRBD resource.
Please note that removing the resource from a currently active node (that is,
where DRBD is Primary) will involve either migrating the service or running
this resource on this node as a DRBD client; it’s easier to
choose a node in Secondary role. (Of course, that might not always be possible.)
5.21.1. Prepare a Bitmap Slot
You will need to have a free bitmap slot for temporary use, on each of the nodes that have the resource that is to be moved.
You can allocate one more at
drbdadm create-md
time, or simply put a placeholder in
your configuration, so that drbdadm
sees that it should reserve one more slot:
resource r0 { ... on for-later-rebalancing { address 10.254.254.254:65533; node-id 3; } }
If you need to make that slot available during live use, you will have to
In a future version |
5.21.2. Preparing and Activating the New Node
First of all you have to
create the underlying storage volume on the new node (using e.g. lvcreate
).
Then the placeholder in the configuration can be filled with the correct host
name, address, and storage path. Now copy the resource configuration to all
relevant nodes.
On the new node initialize the meta-data (once) by doing
# drbdadm create-md <resource> v09 Magic number not found Writing meta data... initialising activity log NOT initializing bitmap New drbd meta data block successfully created.
5.21.3. Starting the Initial Synchronization
Now the new node needs to get the data.
This is done by defining the network connection on the existing nodes using the command:
# drbdadm adjust <resource>
and starting the DRBD device on the new node using the command:
# drbdadm up <resource>
5.21.4. Check Connectivity
At this time, show the status of your DRBD resource by entering the following command on the new node:
# drbdadm status <resource>
Verify that all other nodes are connected.
5.21.5. After the Initial Synchronization
As soon as the new host is UpToDate, one of the
other nodes in the configuration can be renamed to for-later-rebalancing
, and
kept for another migration.
Perhaps you want to comment the section; although that has the risk that
doing a drbdadm create-md for a new node has too few bitmap slots for the
next rebalancing.It might be easier to use a reserved (unused) IP address and host name. |
Copy the changed configuration around again, and use it by running
# drbdadm adjust <resource>
on all nodes.
5.21.6. Cleaning Up
On the one node that had the data up to now, but isn’t used anymore for this resource, you can now take the DRBD device down by entering:
# drbdsetup down <resource>
Use a drbdsetup command rather than a drbdadm command to down the resource
because you cannot use drbdadm to down a resource that is no longer in the configuration file.
|
Now the lower level storage device isn’t used anymore, and can either be
re-used for other purposes or, if it is a logical volume, its space can be
returned to the volume group using the lvremove
command.
5.21.7. Conclusion and Further Steps
One of the resources has been migrated to the new node. The same could be done for one or more other resources, to make free space on two or three nodes in the existing cluster.
Then new resources can be configured, as there are enough nodes with free space to achieve 3-way redundancy again.
5.22. Configuring Quorum
This section describes how you can configure the DRBD quorum feature to avoid split-brain situations and data divergence in your high-availability clusters.
You enable quorum for a DRBD resource by setting the quorum
option within the options
section of a DRBD resource configuration file to either majority
, all
, or a numerical value.
By default, DRBD quorum is not enabled and the option is set to off
.
An example DRBD resource configuration that enables DRBD quorum is as follows:
resource quorum-demo { options { quorum <majority|all|numerical value>; [...] } [...] }
You can also enable quorum globally, by setting the quorum
option within the options
subsection of the common
section within the global DRBD configuration file,
/etc/drbd.d/global_common.conf
. Enabling quorum and any other related options within the
global configuration file will affect all DRBD resources, unless the same option is set within a
specific DRBD resource configuration file. In that case, the option value set within a DRBD
resource configuration file will take precedence.
An example global configuration that enables DRBD quorum is as follows:
[...] common { options { quorum <majority|all|numerical value>; [...] } [...] } [...]
5.22.1. Setting DRBD Quorum to Majority
By setting the quorum
option to majority
, a node can only write to the replicated data set
if the node is a member of a majority partition of the DRBD-running nodes in the cluster. The
number of nodes in a majority partition must be greater than half the number of total nodes in
the cluster for a given DRBD resource. In a 3-node cluster, such a node would need to be able to
communicate over the network with at least one other node.
An exception to this is if other secondary nodes have exited the cluster gracefully, and have marked their data as outdated. Outdated nodes do not take part in quorum voting. This behavior can lead to the Last Man Standing situation. This allows services to keep running in your cluster, even if only one up-to-date node remains.
5.22.2. Setting DRBD Quorum to All
By setting the quorum
option to all
, a node can only write to the replicated data set if the
node can communicate over the network with all other DRBD-running nodes for that data set in the
cluster. This is the strictest possible quorum implementation and the most cautious (or most
paranoid) way to avoid data divergence in your cluster.
The exception to the rule mentioned in the majority
section for gracefully exiting nodes also
applies to the all
quorum implementation. That is, gracefully exiting nodes will mark their
data as outdated and outdated nodes will not take part in quorum voting.
5.22.3. Setting DRBD Quorum to a Numerical Value
For flexibility, you can also set the quorom
option to a numerical value. This quorum
implementation is the most solid, because you, as an administrator will have knowledge about the
total number of nodes in your cluster that might be outside the scope of the majority
and
all
quorum heuristics for some corner cases. However, setting quorum
to a numerical value
requires manual intervention and a property change if you increase or decrease the number of
nodes in your cluster. For almost all cases, setting the quorum
option to majority
is
sufficient and preferred.
If you choose to set the quorom property to a numerical value, choose a numerical
value that is greater than half of the number of total nodes in the cluster. You undermine the
purpose of quorum and risk data divergence if you choose a numerical value less than this.
|
5.22.4. Guaranteed Minimal Redundancy
By using the quorum-minimum-redundancy
option in a DRBD resource or global configuration, you
define that a quorum partition must consist of at least the number of nodes in an up-to-date
state that you specify with the option. The quorum-minimum-redundancy
option takes the same
arguments as the quorum
option: majority
, all
, or a numerical value.
Because only diskful nodes can be in an up-to-date state, using this option is a way to express that you prefer to wait until data resynchronization operations finish before a service that relies on the replicated data can start. That is, you prefer the security of guaranteeing a minimum redundancy of your data over the availability of your service. Financial data redundancy, or data redundancy required for regulatory reasons are examples where setting this option could be useful.
There might be some corner cases where specifying the quorum-minimum-redundancy
option for a DRBD resource could lead to a situation that would require manual intervention to
meet the minimum rendundancy requirement. Also, a potential side-effect of specifying a minimum
data replica redundancy for quorum is that it will prevent Last Man Standing
behavior in your cluster for replica redundancy values greater than 1 . For these reasons,
using this option should not be needed in most cases, unless requirements for your data compel
you to use it.
|
Consider the following example configuration for a 5-node cluster:
resource quorum-demo { options { quorum majority; quorum-minimum-redundancy 2; [...] } [...] }
In this example, for a 5-node cluster, a majority partition consists of three nodes. Because of
the quorum-minimum-redundancy
option, two of the three nodes must be diskful and in an
up-to-date state before the primary node for the resource is allowed to write to the data set.
5.22.5. Actions on Loss of Quorum
When a node that is in a primary role for a DRBD resource loses quorum, the node needs to stop write operations on the data set immediately. That means that I/O immediately starts to return errors for all I/O requests to the DRBD device. Usually that means that a graceful shutdown is not possible. A graceful shutdown would require more modifications to the data set, such as marking the data set out-of-date on the node.
Next, the I/O errors propagate from the block level to the file system and from the file system
to the user space application(s). Ideally the application simply terminates in case of I/O
errors. This then allows a cluster resource manager, such as Pacemaker or DRBD Reactor, to
unmount the file system and to demote the node to secondary role for the DRBD resource. If that
is true of the behavior of your application, you should set the on-no-quorum
resource option
to io-error
. Here is an example configuration:
resource quorum-demo { options { quorum majority; on-no-quorum io-error; [...] } }
If your application does not terminate on the first I/O error, you can configure DRBD to freeze I/O for a resource on a DRBD primary node that loses quorum. Here is a configuration example:
resource quorum-demo { options { quorum majority; on-no-quorum suspend-io; [...] } [...] }
With this configuration, if a primary node does lose quorum and suspends I/O to its data set, you can follow the steps described in the Recovering a Primary Node that Lost Quorum section.
5.22.6. Using a Diskless Node as a Tiebreaker
A diskless node with connections to all nodes in a cluster can be used to break ties in the quorum negotiation process.
Consider the following two-node cluster, where node A is the primary and node B is a secondary:
As soon as the connection between the two nodes is interrupted, they lose quorum and the application on top of the cluster cannot write data anymore.
If you add a third node, C, to the cluster and configure it as diskless, you can take advantage of the Quorum Tiebreaker feature, available in DRBD versions since 9.0.18.
In this case, when the primary and secondary nodes lose connection to each other, each can still communicate with the diskless tiebreaker. Because of this, the primary node can continue working, while the secondary node demotes its DRBD resource to an out-of-date state. While the resource is in an out-of-date state, the node cannot be promoted to a primary role.
There are a few special cases if two connections fail. Consider the following scenario:
In this case, the tiebreaker node forms a partition with the primary node. The primary therefore keeps quorum, while the secondary node becomes outdated.
Here, the secondary node’s DRBD resource state will be “UpToDate”, but regardless it cannot be promoted to a primary role because it lacks quorum. |
Next, consider the possibility of the primary node losing connection to the tiebreaker node:
In this case, the primary node becomes unusable and goes into a “quorum suspended” state. This effectively results in the application on top of DRBD receiving I/O errors. A cluster resource manager such as Pacemaker or DRBD Reactor could then promote node B to a primary role and keep the service running on that node.
You also need to avoid data divergence if the diskless tiebreaker node “switches sides”. Consider this scenario:
The connection between the primary and secondary nodes has failed but the application continues to run on the primary node because it is part of a majority partition with the diskless tiebreaker node. Then the primary node suddenly loses its connection to the diskless node.
In this case, no node can be promoted to a primary role and the cluster cannot continue to operate.
Protecting against data divergence always takes priority over ensuring service availability. |
Next, consider another scenario:
Here, the application is running on the primary node, while the secondary node is unavailable. Then the tiebreaker node first loses connection to the primary node, and then reconnects to the secondary node. In this case, the secondary node cannot become the primary node.
A node that has lost quorum cannot regain quorum by connecting to a diskless node. In this case, there is no reliable way for the reconnecting node to check the “up-to-date”-ness of its data set against a diskless node, because the diskless node does not have a local data set to be able to make an accurate check. Therefore, in this case, no node has quorum and the cluster halts. |
5.22.7. Last Man Standing
Nodes that leave a cluster gracefully are counted differently from failed nodes for determining
DRBD quorum. In this context, leaving gracefully means that a leaving node marked its data as
out-of-date, and that the node was able to tell the remaining nodes that its data is
out-of-date. Leaving gracefully would be equivalent to running a drbdadm down
(or drbdadm disconnect
) command.
In a group of nodes where all disks are out-of-date, no node in that group can be promoted to a primary role.[8]
An implication is that, if one node remains in a cluster where all the other nodes left gracefully, the remaining node can keep quorum. However, if any of the other nodes left ungracefully, the remaining node must assume that the departed nodes could form a partition and have access to up-to-date data.
The last man standing behavior is useful when you need to perform system maintenance such as upgrading software or replacing or adding hardware on nodes. In this case, a node or nodes that you might need to maintain can leave the cluster gracefully and allow a primary node to continue to run, host services, and write to the replicated data set. This gives you an environment where you can service other nodes without having application or service downtime for your users.
5.23. Removing DRBD
For the unlikely case that you want to remove DRBD, here are the necessary steps.
-
Stop the services and unmount the filesystems on top of the DRBD volumes. In case you are using a cluster manager, verify that it ceases to control the services first.
-
Stop the DRBD resource(s) by using
drbdadm down <res>
ordrbdadm down all
-
In case the DRBD resource was using internal meta-data you might choose to resize the file system to cover all of the backing device’s space. This step effectively removes DRBD’s meta-data. This is an action that can not be reversed easily. You can do that with
resize2fs <backing_dev>
for ext[234] family of file systems. It supports resizing of unmounted file systems and under certain conditions also online grow. XFS can be grown online only with thexfs_growfs
command.
-
-
Mount the backing device(s) directly, start the services on top of them
-
Unload the DRBD kernel driver modules with
rmmod drbd_transport_tcp
andrmmod drbd
. -
Uninstall the DRBD software packages.
6. Using DRBD Proxy
6.1. DRBD Proxy Deployment Considerations
The DRBD Proxy processes can either be located directly on the machines where DRBD is set up, or they can be placed on distinct dedicated servers. A DRBD Proxy instance can serve as a proxy for multiple DRBD devices distributed across multiple nodes.
DRBD Proxy is completely transparent to DRBD. Typically you will expect a high number of data
packets in flight, therefore the activity log should be reasonably large. Since this may cause
longer re-sync runs after the failure of a primary node, it is recommended to enable the DRBD
csums-alg
setting.
For more information about the rationale for the DRBD Proxy, please see the feature explanation Long-distance Replication through DRBD Proxy.
The DRBD Proxy 3 uses several kernel features that are only available since 2.6.26, so running it on older systems (for example, RHEL 5) is not possible. Here we can still provide DRBD Proxy 1 packages, though[9].
6.2. Installing DRBD Proxy
To obtain DRBD Proxy, please contact your LINBIT sales representative. Unless instructed otherwise, please always use the most recent DRBD Proxy release.
To install DRBD Proxy on Debian and Debian-based systems, use the dpkg tool as follows (replace version with your DRBD Proxy version, and architecture with your target architecture):
# dpkg -i drbd-proxy_3.2.2_amd64.deb
To install DRBD Proxy on RPM based systems (like SLES or RHEL) use the RPM tool as follows (replace version with your DRBD Proxy version, and architecture with your target architecture):
# rpm -i drbd-proxy-3.2.2-1.x86_64.rpm
Also install the DRBD administration program drbdadm since it is required to configure DRBD Proxy.
This will install the DRBD Proxy binaries as well as an init script which usually goes into
/etc/init.d
. Please always use the init script to start/stop DRBD proxy since it also
configures DRBD Proxy using the drbdadm
tool.
6.3. License File
When obtaining a license from LINBIT, you will be sent a DRBD Proxy license file which is
required to run DRBD Proxy. The file is called drbd-proxy.license
, it must be copied into the
/etc
directory of the target machines, and be owned by the user/group drbdpxy
.
# cp drbd-proxy.license /etc/
6.4. Configuring DRBD Proxy Using LINSTOR
DRBD Proxy can be configured using LINSTOR as described in the LINSTOR User’s Guide.
6.5. Configuring DRBD Proxy Using Resource Files
DRBD Proxy can also be configured by editing resource files. It is configured by an additional
section called proxy
and additional proxy on
sections within the host sections.
Below is a DRBD configuration example for proxies running directly on the DRBD nodes:
resource r0 {
protocol A;
device /dev/drbd15;
disk /dev/VG/r0;
meta-disk internal;
proxy {
memlimit 512M;
plugin {
zlib level 9;
}
}
on alice {
address 127.0.0.1:7915;
proxy on alice {
inside 127.0.0.1:7815;
outside 192.168.23.1:7715;
}
}
on bob {
address 127.0.0.1:7915;
proxy on bob {
inside 127.0.0.1:7815;
outside 192.168.23.2:7715;
}
}
}
The inside
IP address is used for communication between DRBD and the DRBD Proxy, whereas the
outside
IP address is used for communication between the proxies. The latter channel might
have to be allowed in your firewall setup.
6.6. Controlling DRBD Proxy
drbdadm
offers the proxy-up
and proxy-down
subcommands to configure or delete the
connection to the local DRBD Proxy process of the named DRBD resource(s). These commands are
used by the start
and stop
actions which /etc/init.d/drbdproxy
implements.
The DRBD Proxy has a low level configuration tool, called drbd-proxy-ctl
. When called without
any option it operates in interactive mode.
To pass a command directly, avoiding interactive mode, use the -c
parameter followed by the
command.
To display the available commands use:
# drbd-proxy-ctl -c "help"
Note the double quotes around the command being passed.
Here is a list of commands; while the first few ones are typically only used indirectly (via
drbdadm proxy-up
resp. drbdadm proxy-down
), the latter ones give various status information.
add connection <name> lots of arguments
-
Creates a communication path. As this is run via
drbdadm proxy-up
the long list of arguments is omitted here. del connection <name>
-
Removes a communication path.
set memlimit <name> <memlimit-in-bytes>
-
Sets the memory limit for a connection; this can only be done when setting it up afresh, changing it during runtime is not possible. This command understands the usual units
k
,M
, andG
. show
-
Shows currently configured communication paths.
show memusage
-
Shows memory usage of each connection. For example, the following commands monitors memory usage:
# watch -n 1 'drbd-proxy-ctl -c "show memusage"'
The quotes around show memusage
are required. show [h]subconnections
-
Shows currently established individual connections together with some stats. With
h
outputs bytes in human readable format. show [h]connections
-
Shows currently configured connections and their states With
h
outputs bytes in human readable format. TheStatus
column will show one of these states:-
Off: No communication to the remote DRBD Proxy process.
-
Half-up: The connection to the remote DRBD Proxy could be established; the Proxy ⇒ DRBD paths are not up yet.
-
DRBD-conn: The first few packets are being pushed across the connection; but still for example a Split-Brain situation might serve it again.
-
Up: The DRBD connection is fully established.
-
shutdown
-
Shuts down the
drbd-proxy
program.This unconditionally terminates any DRBD connections that are using the DRBD proxy. quit
-
Exits the client program (closes the control connection), but leaves the DRBD proxy running.
print statistics
-
This prints detailed statistics for the currently active connections, in a format that can be easily parsed. Use this for integration to your monitoring solution!
While the commands above are only accepted from UID 0 (that is, the root
user), this one can be used by any user (provided that UNIX permissions allow access on the proxy socket at/var/run/drbd-proxy/drbd-proxy-ctl.socket
). Refer to the init script at/etc/init.d/drbdproxy
about setting the permissions.
6.7. About DRBD Proxy Plugins
Since DRBD Proxy version 3 the proxy allows to enable a few specific plugins for the WAN
connection. The currently available plugins are zstd
, lz4
, zlib
, and lzma
(all software
compression).
zstd
(Zstandard) is a real-time compression algorithm, providing high compression ratios. It
offers a very wide range of compression / speed trade-off, while being backed by a very fast
decoder. Compression rates are dependent on “level” parameter which can be arranged between 1 to
22. Over level 20, DRBD Proxy will require more memory.
lz4
is a very fast compression algorithm; the data typically gets compressed down by 1:2 to
1:4, half- to two-thirds of the bandwidth can be saved.
The zlib
plugin uses the GZIP algorithm for compression; it uses a bit more CPU than lz4
,
but gives a ratio of 1:3 to 1:5.
The lzma
plugin uses the liblzma2
library. It can use dictionaries of several hundred MiB;
these allow for very efficient delta-compression of repeated data, even for small changes.
lzma
needs much more CPU and memory, but results in much better compression than zlib
— real-world tests with a VM sitting on top of DRBD gave ratios of 1:10 to 1:40. The lzma
plugin
has to be enabled in your license.
Contact LINBIT to find the best settings for your environment – it depends on the CPU (speed,
number of threads), available memory, input and available output bandwidth, and expected I/O
spikes. Having a week of sysstat
data already available helps in determining the
configuration, too.
Older compression on in the proxy section is deprecated, and will be removed in
a future release. Currently it is treated as zlib level 9 .
|
6.7.1. Using a WAN-side Bandwidth Limit
The experimental bwlimit
option of DRBD Proxy is broken. Do not use it, as it may cause
applications on DRBD to block on I/O. It will be removed.
Instead use the Linux kernel’s traffic control framework to limit bandwidth consumed by proxy on the WAN side.
In the following example you would need to replace the interface name, the source port and the IP address of the peer.
# tc qdisc add dev eth0 root handle 1: htb default 1 # tc class add dev eth0 parent 1: classid 1:1 htb rate 1gbit # tc class add dev eth0 parent 1:1 classid 1:10 htb rate 500kbit # tc filter add dev eth0 parent 1: protocol ip prio 16 u32 \ match ip sport 7000 0xffff \ match ip dst 192.168.47.11 flowid 1:10 # tc filter add dev eth0 parent 1: protocol ip prio 16 u32 \ match ip dport 7000 0xffff \ match ip dst 192.168.47.11 flowid 1:10
You can remove this bandwidth limitation with:
# tc qdisc del dev eth0 root handle 1
6.8. Troubleshooting
DRBD Proxy logs events through syslog using the LOG_DAEMON
facility. Usually you will find
DRBD Proxy events in /var/log/daemon.log
.
Enabling debug mode in DRBD Proxy can be done with the following command.
# drbd-proxy-ctl -c 'set loglevel debug'
For example, if proxy fails to connect it will log something like Rejecting connection because
I can’t connect on the other side
. In that case, please check if DRBD is running (not in
StandAlone mode) on both nodes and if both proxies are running. Also double-check your
configuration.
7. Troubleshooting and Error Recovery
This chapter describes tasks to be performed in case of hardware or system failures.
7.1. Getting Information About DRBD Error Codes
DRBD and the DRBD administrative tool, drbdadm
, return POSIX error codes. If you need to get
more information about a specific error code number, you can use the following command, provided
that Perl is installed in your environment. For example, to get information about error code
number 11, enter:
# perl -e 'print $! = 11, "\n"' Resource temporarily unavailable
7.2. Dealing with Hard Disk Failure
How to deal with hard disk failure depends on the way DRBD is configured to handle disk I/O errors (see Disk Error Handling Strategies), and on the type of metadata configured (see DRBD Metadata).
For the most part, the steps described here apply only if you run DRBD directly on top of physical hard disks. They generally do not apply in case you are running DRBD layered on top of
|
7.2.1. Manually Detaching DRBD from Your Hard Disk
If DRBD is configured to pass on I/O errors (not recommended), you must first detach the DRBD resource, that is, disassociate it from its backing storage:
# drbdadm detach <resource>
By running the drbdadm status
or the drbdadm dstate
command, you will now be able to verify
that the resource is now in diskless
mode:
# drbdadm status <resource> <resource> role:Primary volume:0 disk:Diskless <peer> role:Secondary volume:0 peer-disk:UpToDate # drbdadm dstate <resource> Diskless/UpToDate
If the disk failure has occurred on your primary node, you may combine this step with a switch-over operation.
7.2.2. Automatically Detaching on I/O Error
If DRBD is configured to automatically detach
upon I/O error (the recommended option), DRBD should have
automatically detached the resource from its backing storage already,
without manual intervention. You may still use the drbdadm status
command to verify that the resource is in fact running in diskless
mode.
7.2.3. Replacing a Failed Disk When Using Internal Metadata
If using internal metadata, it is sufficient to bind the DRBD device to the new hard disk. If the new hard disk has to be addressed by another Linux device name than the defective disk, the DRBD configuration file has to be modified accordingly.
This process involves creating a new metadata set, then reattaching the resource:
# drbdadm create-md <resource> v08 Magic number not found Writing meta data... initialising activity log NOT initializing bitmap New drbd meta data block successfully created. # drbdadm attach <resource>
Full synchronization of the new hard disk starts instantaneously and
automatically. You will be able to monitor the synchronization’s
progress using the drbdadm status --verbose
command, as with any background synchronization.
7.2.4. Replacing a Failed Disk When Using External Metadata
When using external metadata, the procedure is basically the same. However, DRBD is not able to recognize independently that the hard disk was swapped, therefore an additional step is required.
# drbdadm create-md <resource> v08 Magic number not found Writing meta data... initialising activity log NOT initializing bitmap New drbd meta data block successfully created. # drbdadm attach <resource> # drbdadm invalidate <resource>
Be sure to run drbdadm invalidate on the node *without* good data;
this command will cause the local contents to be overwritten with data from the
peers, so running this command on the wrong node might lose data!
|
Here, the drbdadm invalidate
command triggers synchronization. Again,
sync progress may be observed using the drbdadm status --verbose
command.
7.3. Dealing with Node Failure
When DRBD detects that its peer node is down (either by true hardware failure or manual intervention), DRBD changes its connection state from Connected to Connecting and waits for the peer node to reappear. The DRBD resource is then said to operate in disconnected mode. In disconnected mode, the resource and its associated block device are fully usable, and may be promoted and demoted as necessary, but no block modifications are being replicated to the peer node. Instead, DRBD stores which blocks are being modified while disconnected, on a per-peer basis.
7.3.1. Dealing with Temporary Secondary Node Failure
If a node that currently has a resource in the secondary role fails temporarily (due to, for example, a memory problem that is subsequently rectified by replacing RAM), no further intervention is necessary — besides the obvious necessity to repair the failed node and bring it back online. When that happens, the two nodes will simply re-establish connectivity upon system start-up. After this, DRBD synchronizes all modifications made on the primary node in the meantime to the secondary node.
At this point, due to the nature of DRBD’s re-synchronization algorithm, the resource is briefly inconsistent on the secondary node. During that short time window, the secondary node can not switch to the Primary role if the peer is unavailable. Therefore, the period in which your cluster is not redundant consists of the actual secondary node down time, plus the subsequent re-synchronization. |
Please note that with DRBD 9 more than two nodes can be connected for each resource. So, for example, in the case of four nodes, a single failing secondary still leaves two other secondaries available for failover.
7.3.2. Dealing with Temporary Primary Node Failure
From DRBD’s standpoint, failure of the primary node is almost identical to a failure of the secondary node. The surviving node detects the peer node’s failure, and switches to disconnected mode. DRBD does not promote the surviving node to the primary role; it is the cluster management application’s responsibility to do so.
When the failed node is repaired and returns to the cluster, it does so in the secondary role, therefore, as outlined in the previous section, no further manual intervention is necessary. Again, DRBD does not change the resource role back, it is up to the cluster manager to do so (if so configured).
DRBD ensures block device consistency in case of a primary node failure by way of a special mechanism. For a detailed discussion, refer to The Activity Log.
7.3.3. Dealing with Permanent Node Failure
If a node suffers an unrecoverable problem or permanent destruction, you must follow the following steps:
-
Replace the failed hardware with one with similar performance and disk capacity.
Replacing a failed node with one with worse performance characteristics is possible, but not recommended. Replacing a failed node with one with less disk capacity is not supported, and will cause DRBD to refuse to connect to the replaced node[10]. -
Install the base system and applications.
-
Install DRBD and copy
/etc/drbd.conf
and all of/etc/drbd.d/
from one of the surviving nodes. -
Follow the steps outlined in Configuring DRBD, but stop short of The initial device synchronization.
Manually starting a full device synchronization is not necessary at this point. The synchronization will commence automatically upon connection to the surviving primary or secondary node(s), or both.
7.4. Manual Split Brain Recovery
DRBD detects split brain at the time connectivity becomes available again and the peer nodes exchange the initial DRBD protocol handshake. If DRBD detects that both nodes are (or were at some point, while disconnected) in the primary role, it immediately tears down the replication connection. The tell-tale sign of this is a message like the following appearing in the system log:
Split-Brain detected, dropping connection!
After split brain has been detected, one node will always have the resource in a StandAlone connection state. The other might either also be in the StandAlone state (if both nodes detected the split brain simultaneously), or in Connecting (if the peer tore down the connection before the other node had a chance to detect split brain).
At this point, unless you configured DRBD to automatically recover from split brain, you must manually intervene by selecting one node whose modifications will be discarded (this node is referred to as the split brain victim). This intervention is made with the following commands:
# drbdadm disconnect <resource> # drbdadm secondary <resource> # drbdadm connect --discard-my-data <resource>
On the other node (the split brain survivor), if its connection state is also StandAlone, you would enter:
# drbdadm disconnect <resource> # drbdadm connect <resource>
You may omit this step if the node is already in the Connecting state; it will then reconnect automatically.
Upon connection, your split brain victim immediately changes its connection state to SyncTarget, and gets its modifications overwritten by the other node(s).
The split brain victim is not subjected to a full device synchronization. Instead, it has its local modifications rolled back, and any modifications made on the split brain survivor(s) propagate to the victim. |
After re-synchronization has completed, the split brain is considered resolved and the nodes form a fully consistent, redundant replicated storage system again.
7.5. Recovering a Primary Node that Lost Quorum
The following instructions apply to cases where the DRBD on-loss-of-quorum action has been set to suspend I/O operations. In cases where the action has been set to generate I/O errors, the instructions are unnecessary. |
The DRBD administration tool, drbdadm
, includes a force secondary option, secondary --force
.
If DRBD quorum was configured to suspend DRBD resource I/O operations upon loss of quorum, the
force secondary option will allow you to gracefully recover the node that lost quorum and
reintegrate it with the other nodes.
Requirements:
-
DRBD version 9.1.7 or newer
-
drbd-utils
version 9.21 or newer
You can use the command drbdadm secondary --force <all|resource_name>
to demote a primary node to secondary, in cases where you are trying to recover a primary node that lost quorum. The argument to this command can be either a single DRBD resource name or all
to demote the node to a secondary role for all its DRBD resources.
By using this command on the primary node that lost quorum with suspended I/O operations, all
the suspended I/O requests and newly submitted I/O requests will terminate with I/O errors. You
can then usually unmount the file system and reconnect the node to the other nodes in your cluster. An edge
case is a file system opener that does not do any I/O and just idles around. Such processes need to be removed
manually before unmounting will succeed or with the help of external tools such as fuser -k
, or the OCF file
system resource agent in clustered setups.
Along with the DRBD administration tool’s force secondary option, you can also add the
on-suspended-primary-outdated
option to a DRBD resource configuration file and set it to the
keyword value force-secondary
. You will also need to add the resource role conflict
(rr-conflict
) option to the DRBD resource configuration file’s net
section, and set it to
retry-connect
. This enables DRBD to automatically recover a primary node that loses quorum
with suspended I/O operations. With these options configured, when such a node connects to a
cluster partition that has a more recent data set, DRBD automatically demotes the primary node
that lost quorum and has suspended I/O operations. Additional configurations, for example in a
handlers
section of the resource configuration file, as well as additional configurations
within a cluster manager, may also be necessary to complete a fully automatic recovery setup.
Settings within a DRBD resource configuration file’s options
section that cover this scenario
could look like this:
resource <resource_name> { net { rr-conflict retry-connect; [...] } options { quorum majority; # or explicit value on-no-quorum suspend-io; on-no-data-accessible suspend-io; on-suspended-primary-outdated force-secondary; [...] } [...] }
DRBD-enabled Applications
8. DRBD Reactor
DRBD Reactor is a daemon that monitors DRBD events and reacts to them. DRBD Reactor has various potential uses, from monitoring DRBD resources and metrics, to creating failover clusters to providing highly available services that you would usually need to configure using complex cluster managers.
8.1. Installing DRBD Reactor
DRBD Reactor can be installed from source files found within the project’s GitHub repository. See the instructions there for details and any prerequisites.
Alternatively, LINBIT customers can install DRBD Reactor from prebuilt packages, available from
LINBIT’s drbd-9
packages repository.
Once installed, you can verify DRBD Reactor’s version number by using the drbd-reactor
--version
command.
8.2. DRBD Reactor’s Components
Because DRBD Reactor has many different uses, it was split into two components: a core component and a plugin component.
8.2.1. DRBD Reactor Core
DRBD Reactor’s core component is responsible for collecting DRBD events, preparing them, and sending them to the DRBD Reactor plugins.
The core can be reloaded with an all new or an additional, updated configuration. It can stop plugin instances no longer required and start new plugin threads without losing DRBD events. Last but not least, the core has to ensure that plugins receive an initial and complete DRBD state.
8.2.2. DRBD Reactor Plugins
Plugins provide DRBD Reactor with its functionality and there are different plugins for different uses. A plugin receives messages from the core component and acts upon DRBD resources based on the message content and according to the plugin’s type and configuration.
Plugins can be instantiated multiple times, so there can be multiple instances of every plugin type. So, for example, numerous plugin instances could provide high-availability in a cluster, one per DRBD resource.
8.2.3. The Promoter Plugin
The promoter plugin is arguably DRBD Reactor’s most important and useful feature. You can use it to create failover clusters hosting highly available services more easily than using other more complex cluster resource managers (CRMs). If you want to get started quickly, you can finish reading this section, then skip to Configuring the Promoter Plugin. You can then try the instructions in the Using DRBD Reactor’s Promoter Plugin to Create a Highly Available File System Mount section for an example exercise.
The promoter plugin monitors events on DRBD resources and executes systemd units. This plugin allows DRBD Reactor to provide failover functionality to a cluster to create high-availability deployments. You can use DRBD Reactor and its promoter plugin as a replacement for other CRMs, such as Pacemaker, in many scenarios where its lightness and its configuration simplicity offer advantages.
For example, you can use the promoter plugin to configure fully automatic recovery of isolated primary nodes. Furthermore, there is no need for a separate communication layer (such as Corosync), because DRBD and DRBD Reactor (used as the CRM) will always agree on the quorum status of nodes.
A disadvantage to the promoter plugin when compared to a CRM such as Pacemaker is that it is not possible to create order constraints that are independent of colocations. For example, if a web service and a database run on different nodes, Pacemaker can constrain the web service to start after the database. DRBD Reactor and its promoter plugin cannot.
How the Promoter Plugin Works
The promoter plugin’s main function is that if a DRBD device can be promoted, promote it to Primary and start a set of user-defined services. This could be a series of services, such as:
-
Promote the DRBD device.
-
Mount the device to a mount point.
-
Start a database that uses a database located at the mount point.
If a resource loses quorum, DRBD Reactor stops these services so that another node that still has quorum (or the node that lost quorum when it has quorum again) can start the services.
The promoter plugin also supports Open Cluster Framework (OCF) resource agents and failure actions such as rebooting a node if a resource fails to demote, so that the resource can promote on another node.
8.2.4. The User Mode Helper (UMH) Plugin
Using this plugin and its domain specific language (DSL), you can execute a script if an event you define occurs. For example, you can run a script that sends a Slack message whenever a DRBD resource loses connection.
This functionality has existed before in DRBD with “user-defined helper scripts” in “kernel space”. However, DRBD Reactor, including the UMH plugin, can be executed in “user space”. This allows for easier container deployments and use with “read-only” host file systems such as those found within container distributions.
Using UMH plugins also provides a benefit beyond what was previously possible using user defined helper scripts: Now you can define your own rules for all the events that are possible for a DRBD resource. You are no longer limited to only the few events that there are event handlers in the kernel for.
UMH plugin scripts can be of two types:
-
User-defined filters. These are “one-shot” UMH scripts where an event happens that triggers the script.
-
Kernel called helper replacements. This type of script is currently under development. These are UMH scripts that require communication to and from the kernel. An event triggers the script but an action within the script requires the kernel to communicate back to the script so that the script can take a next action, based on the failure or success of the kernel’s action. An example of such a script would be a
before-resync-target
activated script.
8.2.5. The Prometheus Monitoring Plugin
This plugin provides a Prometheus compatible endpoint that exposes various DRBD metrics, including out-of-sync bytes, resource roles (for example, Primary), and connection states (for example, Connected). This information can then be used in every monitoring solution that supports Prometheus endpoints. The full set of metrics and an example Grafana dashboard are available at the DRBD Reactor GitHub repository.
8.2.6. The AgentX Plugin for SNMP Monitoring
This plugin acts as an AgentX subagent for SNMP to expose various DRBD metrics, for example, to monitor DRBD resources via SNMP. AgentX is a standardized protocol that can be used between the SNMP daemon and a subagent, such as the AgentX plugin in DRBD Reactor.
The DRBD metrics that this plugin exposes to the SNMP daemon are shown in the project’s source code repository.
8.3. Configuring DRBD Reactor
Before you can run DRBD Reactor, you must configure it. Global configurations are made within a
main TOML configuration file, which should be created here: /etc/drbd-reactor.toml
. The file
has to be a valid TOML (https://toml.io) file. Plugin configurations should be made within
snippet files that can be placed into the default DRBD Reactor snippets directory,
/etc/drbd-reactor.d
, or into another directory if specified in the main configuration file. An
example
configuration file can be found in the example
directory of the DRBD Reactor GitHub
repository.
For documentation purposes only, the example configuration file mentioned above contains example plugin configurations. However, for deployment, plugin configurations should always be made within snippet files.
8.3.1. Configuring DRBD Reactor’s Core
DRBD Reactor’s core configuration file consists of global settings and log level settings.
Global settings include specifying a snippets directory, specifying a statistics update polling time period, as well as specifying a path to a log file. You can also set the log level within the configuration file to one of: trace, debug, info, warn, error, off. “Info” is the default log level.
See the drbd-reactor.toml
man page for the syntax of these settings.
8.3.2. Configuring DRBD Reactor Plugins
You configure DRBD Reactor plugins by editing TOML formatted snippet files. Every plugin can
specify an ID (id
) in its configuration section. On a DRBD Reactor daemon reload, started
plugins that are still present in the new configuration keep running. Plugins without an ID
get stopped and restarted if still present in the new configuration.
For plugins without an ID, every DRBD Reactor service reload is a restart. |
8.3.3. Configuring the Promoter Plugin
You will typically have one snippet file for each DRBD resource that you want DRBD Reactor and the promoter plugin to watch and manage.
Here is an example promoter plugin configuration snippet:
[[promoter]] [promoter.resources.my_drbd_resource] (1) dependencies-as = "Requires" (2) target-as = "Requires" (3) start = ["path-to-my-file-system-mount.mount", "foo.service"] (4) on-drbd-demote-failure = "reboot" (5) secondary-force = true (6) preferred-nodes = ["nodeA", "nodeB"] (7)
1 | “my_drbd_resource” specifies the name of the DRBD resource that DRBD Reactor and the promoter plugin should watch and manage. |
2 | Specifies the systemd dependency type to generate inter-service dependencies as. |
3 | Specifies the systemd dependency type to generate service dependencies in the final target unit. |
4 | start specifies what should be started when the watched DRBD resource is promotable. In
this example, the promoter plugin would start a file system mount unit and a service unit. |
5 | Specifies the action to take if a DRBD resource fails to demote, for example, after a loss of quorum event. In such a case, an action should be taken on the node that fails to demote that will trigger some “self-fencing” of the node and cause another node to promote. Actions can be one of: reboot, reboot-force, reboot-immediate, poweroff, poweroff-force, poweroff-immediate, exit, exit-force. |
6 | If a node loses quorum, DRBD Reactor will try to demote the node to a secondary role. If the resource was configured to suspend I/O operations upon loss of quorum, this setting specifies whether or not to demote the node to a secondary role using `drbdadm’s force secondary feature. See the Recovering a Primary Node that Lost Quorum section of the DRBD User’s Guide for more details. “true” is the default option if this setting is not specified. It is specified here for illustrative purposes. |
7 | If set, resources are started on the preferred nodes, in the specified order, if possible. |
Specifying a Promoter Start List Service String Spanning Multiple Lines
For formatting or readability reasons, it is possible to split a long service string across multiple lines within a promoter plugin snippet file’s start list of services. You can do this by using TOML syntax for multi-line basic strings. In the following example, the first and third service strings in a promoter plugin’s start list are split across multiple lines. A backslash (\) at the end of a line within a multi-line basic string ensures that a newline character is not inserted between lines within the string.
[...] start = [ """ ocf:heartbeat:Filesystem fs_mysql device=/dev/drbd1001 \ directory=/var/lib/mysql fstype=ext4 run_fsck=no""", "mariadb.service", """ocf:heartbeat:IPaddr2 db_virtip ip=192.168.222.65 \ cidr_netmask=24 iflabel=virtualip""" ] [...]
You can also use this technique to split up long strings within other plugin snippet files. |
Configuring Resource Freezing
Starting with DRBD Reactor version 0.9.0, you can configure the promoter plugin to “freeze” a resource that DRBD Reactor is controlling, rather than stopping it when a currently active node loses quorum. DRBD Reactor can then “thaw” the resource when the node regains quorum and becomes active, rather than having to restart the resource if it was stopped.
While in most cases the default stop and start behavior will be preferred, the freeze and thaw configuration could be useful for a resource that takes a long time to start, for example, a resource that includes services such as a large database. If a Primary node loses quorum in such a cluster, and the remaining nodes are unable to form a partition with quorum, freezing the resource could be useful, especially if the Primary node’s loss of quorum was momentary, for example due to a brief network issue. When the formerly Primary node with a frozen resource reconnects with its peer nodes, the node would again become Primary and DRBD Reactor would thaw the resource. The result of this behavior could be that the resource is again available in seconds, rather than minutes, because the resource did not have to start from a stopped state, it only had to resume from a frozen one.
Requirements:
Before configuring the promoter plugin’s freeze feature for a resource, you will need:
-
A system that uses cgroup v2, implementing unified cgroups. You can verify this by the presence of
/sys/fs/cgroup/cgroup.controllers
on your system. If this is not present, and your kernel supports it, you should be able to add the kernel command line argumentsystemd.unified_cgroup_hierarchy=1
to enable this feature.This should only be relevant for RHEL 8, Ubuntu 20.04, and earlier versions. -
The following DRBD options configured for the resource:
-
on-no-quorum
set tosuspend-io
; -
on-no-data-accessible
set tosuspend-io
; -
on-suspended-primary
set toforce-secondary
; -
rr-conflict
(net
option) set toretry-connect
.
-
-
A resource that can “tolerate” freezing and thawing. You can test how your resource (and any applications that rely on the resource) respond to freezing and thawing by using the
systemctl freeze <systemd_unit>
, and thesystemctl thaw <systemd_unit>
commands. Here you specify the systemd unit or units that correspond to the start list of services within the promoter plugin’s configuration. You can use these commands to test how your applications behave, after services that they depend on are frozen and thawed.If you are unsure whether your resource and applications will tolerate freezing, then it is safer to keep the default stop and start behavior.
To configure resource freezing, add the following line to your DRBD Reactor resource’s promoter plugin snippet file:
on-quorum-loss = "freeze"
Using OCF Resource Agents with the Promoter Plugin
You can also configure the promoter plugin to use OCF resource agents in the start
list of
services.
If you have a LINBIT customer or evaluation account, you can install the resource-agents
package available in LINBIT’s drbd-9 package repository to install a suite of open source
resource agent scripts, including the “Filesystem” OCF resource agent.
|
The syntax for specifying an OCF resource agent as a service within a start
list is
ocf:$vendor:$agent instance-id [key=value key=value …]
. Here, instance-id
is user-defined
and key=value
pairs, if specified, are passed as environment variables to the created systemd
unit file. For example:
[[promoter]] [...] start = ["ocf:heartbeat:IPaddr2 ip_mysql ip=10.43.7.223 cidr_netmask=16"] [...]
The promoter plugin expects OCF resource agents in the /usr/lib/ocf/resource.d/
directory.
|
When to Use systemd Mount Units and OCF Filesystem Resource Agents
Almost all scenarios that you might use DRBD Reactor and its promoter plugin will likely involve a file system mount. If your use case involves a promoter start list of services with other services or applications besides a file system mount, then you should use a systemd mount unit to handle the file system mounting.
However, you should not use a systemd file system mount unit if a file system mount point is the end goal, that is, it would be the last service in your promoter plugin start list of services. Instead, use an OCF Filesystem resource agent to handle mounting and unmounting the file system.
In this case, using an OCF resource agent is preferred because the resource agent will be able
to escalate the demotion of nodes, by using kill
actions and other various signals against
processes that might be holding the mount point open. For example, there could be a user running
an application against a file in the file system that systemd would not know about. In that
case, systemd would not be able to unmount the file system and the promoter plugin would not be
able to demote the node.
You can find more information in the DRBD Reactor GitHub documentation.
8.3.4. Configuring the User Mode Helper (UMH) Plugin
Configuration for this plugin consists of:
-
Rule type
-
Command or script to execute
-
User-defined environment variables (optional)
-
Filters based on DRBD resource name, event type, or state changes
There are four different DRBD types a rule can be defined for: resource
, device
,
peerdevice
, or connection
.
For each rule type, you can configure a command or script to execute using sh -c
as well as
any user-defined environment variables. User-defined environment variables are in addition to
the commonly set ones:
-
HOME “/”
-
TERM “Linux”
-
PATH “/sbin:/usr/sbin:/bin:/usr/bin”
You can also filter UMH rule types by DRBD resource name or event type (exists, create, destroy, or change).
Finally, you can filter the plugin’s action based on DRBD state changes. Filters should be based upon both the old and the new (current) DRBD state, that are reported to the plugin, because you want the plugin to react to changes. This is only possible if two states, old and new, are filtered for, otherwise the plugin might trigger randomly. For example, if you only specified a new (current) DRBD role as a DRBD state to filter for, the plugin might trigger even when the new role is the same as the old DRBD role.
Here is an example UMH plugin configuration snippet for a resource
rule:
[[umh]] [[umh.resource]] command = "slack.sh $DRBD_RES_NAME on $(uname -n) from $DRBD_OLD_ROLE to $DRBD_NEW_ROLE" event-type = "Change" resource-name = "my-resource" old.role = { operator = "NotEquals", value = "Primary" } new.role = "Primary"
This example UMH plugin configuration is based on change event messages received from DRBD
Reactor’s daemon for the DRBD resource specified by the resource-name
value my-resource
.
If the resource’s old role was not Primary and its new (current) role is Primary, then a
script named slack.sh
runs with the arguments that follow. As the full path is not specified,
the script needs to reside within the commonly set PATH
environment variable
(/sbin:/usr/sbin:/bin:/usr/bin
) of the host machine (or container if run that way).
Presumably, the script sends a message to a Slack channel informing of the resource role change.
Variables specified in the command string value are substituted for based on specified values
elsewhere in the plugin’s configuration, for example, the value specified by resource-name
will be substituted for $DRBD_RES_NAME
when the command runs.
The example configuration above uses the specified operator “NotEquals” to evaluate
whether or not the old.role value of “Primary” was true. If you do not specify an operator,
then the default operator is “Equals”, as in the new.role = "Primary" filter in the example
configuration.
|
There are more rules, fields, filter types, and variables that you can specify in your UMH plugin configurations. See the UMH documentation page in the DRBD Reactor GitHub repository for more details, explanations, examples, and caveats.
8.3.5. Configuring the Prometheus Plugin
This plugin provides a Prometheus compatible HTTP endpoint serving DRBD monitoring metrics,
such as the DRBD connection state, whether or not the DRBD device has quorum, number of bytes
out of sync, indication of TCP send buffer congestion, and many more. The
drbd-reactor.prometheus
man page has a full list of metrics and more details.
8.3.6. Configuring the AgentX Plugin for SNMP Monitoring
Configuring the AgentX plugin involves installing an SNMP management information base (MIB) that defines the DRBD metrics that will be exposed, configuring the SNMP daemon, and editing a DRBD Reactor configuration snippet file for the AgentX plugin.
You will need to complete the following setup steps on all your DRBD Reactor nodes. |
Prerequisites
Before configuring this plugin to expose various DRBD metrics to an SNMP daemon, you will need to install the following packages, if they are not already installed.
For RPM-based systems:
# dnf -y install net-snmp net-snmp-utils
For DEB-based systems:
# apt -y install snmp snmpd
If you encounter errors related to missing MIBs when using SNMP commands against the
LINBIT MIB, you will have to download the missing MIBs. You can do this manually or else install
the snmp-mibs-downloader DEB package.
|
AgentX Firewall Considerations
If you are using a firewall service, you will need to allow TCP traffic via port 705 for the AgentX protocol.
Installing the LINBIT DRBD Management Information Base
To use the AgentX plugin, download the LINBIT DRBD MIB to /usr/share/snmp/mibs
.
# curl -L https://github.com/LINBIT/drbd-reactor/raw/master/example/LINBIT-DRBD-MIB.mib \ -o /usr/share/snmp/mibs/LINBIT-DRBD-MIB.mib
Configuring the SNMP Daemon
To configure the SNMP service daemon, add the following lines to its configuration file
(/etc/snmp/snmpd.conf
):
# add LINBIT ID to the system view and enable agentx view systemview included .1.3.6.1.4.1.23302 master agentx agentXSocket tcp:127.0.0.1:705
Verify that the view name that you use matches a view name that is
configured appropriately in the SNMP configuration file. The example above shows systemview as
the view name used in a RHEL 8 system. For Ubuntu, the view name could be
different, for example, in Ubuntu 22.04 it is systemonly .
|
Next, enable and start the service (or restart the service if it was already enabled and running):
# systemctl enable --now snmpd.service
Editing the AgentX Plugin Configuration Snippet File
The AgentX plugin needs only minimal configuration in a DRBD Reactor snippet file. Edit the configuration snippet file by entering the following command:
# drbd-reactorctl edit -t agentx agentx
Then add the following lines:
[[agentx]] address = "localhost:705" cache-max = 60 # seconds agent-timeout = 60 # seconds snmpd waits for an answer peer-states = true # include peer connection and disk states
If you use the # systemctl reload drbd-reactor.service |
Verifying the AgentX Plugin Operation
Before verifying the AgentX plugin operation, first verify that the SNMP service exposes a standard, preinstalled MIB, by entering the following command:
# snmpwalk -Os -c public -v 2c localhost iso.3.6.1.2.1.1.1 sysDescr.0 = STRING: Linux linstor-1 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 25 09:13:12 EDT 2023 x86_64
Next, verify that the AgentX plugin is shown in the output of a drbd-reactorctl status
command.
/etc/drbd-reactor.d/agentx.toml: AgentX: connecting to main agent at localhost:705 [...]
Next, show the LINBIT MIB table structure by entering the following command:
# snmptranslate -Tp -IR -mALL linbit
Finally, you can use an snmptable
command to show a table of the values held in the MIB,
particular to your current DRBD setup and resources. The example command below starts showing
the values for your DRBD resources at the enterprises.linbit.1.2
(enterprises.linbit.drbdData.drbdTable
) object identifier (OID) within the LINBIT MIB.
# snmptable -m ALL -v 2c -c public localhost enterprises.linbit.1.2 | less -S
Using the AgentX Plugin With LINSTOR
If you are using DRBD Reactor and its AgentX plugin to work with LINSTOR®-created DRBD resources, note that these DRBD resources will start from minor number 1000, rather than 1. So, for example, to get the DRBD resource name of the first LINSTOR-created resource on a particular node, enter the following command:
# snmpget -m ALL -v 2c -c public localhost .1.3.6.1.4.1.23302.1.2.1.2.1000 LINBIT-DRBD-MIB::ResourceName.1000 = STRING: linstor_db
8.4. Using the DRBD Reactor CLI Utility
You can use the DRBD Reactor CLI utility, drbd-reactorctl
, to control the DRBD Reactor daemon
and its plugins.
This utility only operates on plugin snippets. Any existing plugin configurations in the main configuration file (not advised nor supported) should be moved to snippet files within the snippets directory. |
With the drbd-reactorctl
utility, you can:
-
Get the status of the DRBD Reactor daemon and enabled plugins, by using the
drbd-reactorctl status
command. -
Edit an existing or create a new plugin configuration, by using the
drbd-reactorctl edit -t <plugin_type> <plugin_file>
command. -
Display the TOML configuration of a given plugin, by using the
drbd-reactorctl cat <plugin_file>
command. -
Enable or disable a plugin, by using the
drbd-reactorctl enable|disable <plugin_file>
command. -
Evict a promoter plugin resource from the node, by using the
drbd-reactorctl evict <plugin_file>
command. -
Restart specified plugins (or the DRBD Reactor daemon, if no plugins specified) by using the
drbd-reactorctl restart <plugin_file>
command. Remove an existing plugin and restart the daemon, by using thedrbd-reactorctl rm <plugin_file>
command. -
List the activated plugins, or optionally list disabled plugins, by using the
drbd-reactorctl ls [--disabled]
command.
For greater control of some of the above actions, there are additional options available. The
drbd-reactorctl
man page has more details and syntax information.
8.4.1. Pacemaker CRM Shell Commands and Their DRBD Reactor Client Equivalents
The following table shows some common CRM tasks and the corresponding Pacemaker CRM shell and the equivalent DRBD Reactor client commands.
CRM task | Pacemaker CRM shell command | DRBD Reactor client command |
---|---|---|
Get status |
|
|
Migrate away |
|
|
Unmigrate |
|
Unnecessary |
A DRBD Reactor client command that is equivalent to crm resource unmigrate
is unnecessary
because DRBD Reactor’s promoter plugin evicts a DRBD resource in the moment, but it does not
prevent the resource from failing back to the node it was evicted from later, should the
situation arise. In contrast, the CRM shell migrate
command inserts a permanent constraint
into the cluster information base (CIB) that prevents the resource from running on the node
the command is run on. The CRM shell unmigrate
command is a manual intervention that removes
the constraint and allows the resource to fail back to the node the command is run on. A
forgotten unmigrate
command can have dire consequences the next time the node might be
needed to host the resource during an HA event.
If you need to prevent failback to a particular node, you can evict it by using the DRBD
Reactor client with the evict --keep-masked command and flag. This prevents failback, until
the node reboots and the flag gets removed. You can remove the flag sooner than a reboot would,
by using the drbd-reactorctl evict --unmask command. This command would be the equivalent to
CRM shell’s unmigrate command.
|
8.5. Using DRBD Reactor’s Promoter Plugin to Create a Highly Available File System Mount
In this example, you will use DRBD Reactor and the promoter plugin to create a highly available file system mount within a cluster.
Prerequisites:
-
A directory
/mnt/test
created on all of your cluster nodes -
A DRBD configured resource named ha-mount that is backed by a DRBD device on all nodes. The configuration examples that follow use
/dev/drbd1000
. -
The Cluster Labs “Filesystem” OCF resource agent, available through Cluster Lab’s
resource-agents
GitHub repository, should be present in the/usr/lib/ocf/resource.d/heartbeat
directoryIf you have a LINBIT customer or evaluation account, you can install the resource-agents
package available in LINBIT’sdrbd-9
package repository to install a suite of open source resource agent scripts, including the “Filesystem” OCF resource agent.
The DRBD resource, ha-mount, should have the following settings configured in its DRBD resource configuration file:
resource ha-mount { options { auto-promote no; quorum majority; on-no-quorum suspend-io; on-no-data-accessible suspend-io; [...] } [...] }
First, make one of your nodes Primary for the ha-mount resource.
# drbdadm primary ha-mount
Then create a file system on the DRBD backed device. The ext4 file system is used in this example.
# mkfs.ext4 /dev/drbd1000
Make the node Secondary because after further configurations, DRBD Reactor and the Promoter plugin will control promoting nodes.
# drbdadm secondary ha-mount
On all nodes that should be able to mount the DRBD backed device, create a systemd unit file:
# cat << EOF > /etc/systemd/system/mnt-test.mount [Unit] Description=Mount /dev/drbd1000 to /mnt/test [Mount] What=/dev/drbd1000 Where=/mnt/test Type=ext4 EOF
The systemd unit file name must match the mount location value given by the “Where=”
directive, using systemd escape logic. In the example above, mnt-test.mount matches the mount
location given by Where=/mnt/test . You can use the command systemd-escape -p --suffix=mount
/my/mount/point to convert your mount point to a systemd unit file name.
|
Next, on the same nodes as the previous step, create a configuration file for the DRBD Reactor promoter plugin:
# cat << EOF > /etc/drbd-reactor.d/ha-mount.toml [[promoter]] [promoter.resources.ha-mount] start = [ """ocf:heartbeat:Filesystem fs_test device=/dev/drbd1000 \ directory=/mnt/test fstype=ext4 run_fsck=no""" ] on-drbd-demote-failure = "reboot" EOF
This promoter plugin configuration uses a start list of services that specifies an OCF
resource agent for the file system found at your HA mount point. By using this particular
resource agent, you can circumvent situations where systemd might not know about certain users
and processes that might hold the mount point open and prevent it from unmounting. This could
happen if you specified a systemd mount unit for the mount point, for example,
start = ["mnt-test.mount"] , rather than using the OCF Filesystem resource agent.
|
To apply the configuration, enable and start the DRBD Reactor service on all nodes. If the DRBD Reactor service is already running, reload it instead.
# systemctl enable drbd-reactor.service --now
Next, verify which cluster node is in the Primary role for the ha-mount resource and has the backing device mounted.
# drbd-reactorctl status ha-mount
Test a simple failover situation on the Primary node by using the DRBD Reactor CLI utility to disable the ha-mount configuration.
# drbd-reactorctl disable --now ha-mount
Run the DRBD Reactor status command again to verify that another node is now in the Primary role and has the file system mounted.
After testing failover, you can enable the configuration on the node you disabled it on earlier.
# drbd-reactorctl enable ha-mount
As a next step, you may want to read the LINSTOR User’s Guide section on creating a highly available LINSTOR cluster. There, DRBD Reactor is used to manage the LINSTOR Controller as a service so that it is highly available within your cluster.
8.6. Configuring DRBD Reactor’s Prometheus Plugin
DRBD Reactor’s Prometheus monitoring plugin acts as a Prometheus compatible endpoint for DRBD resources and exposes various DRBD metrics. You can find a list of the available metrics in the documentation folder in the project’s GitHub repository.
Prerequisites:
-
Prometheus is installed with its service enabled and running.
-
Grafana is installed with its service enabled and running.
To enable the Prometheus plugin, create a simple configuration file snippet on all DRBD Reactor nodes that you are monitoring.
# cat << EOF > /etc/drbd-reactor.d/prometheus.toml [[prometheus]] enums = true address = "0.0.0.0:9942" EOF
Reload the DRBD Reactor service on all nodes that you are monitoring.
# systemctl reload drbd-reactor.service
Add the following DRBD Reactor monitoring endpoint to your Prometheus configuration file’s
scrape_configs
section. Replace “node-x” in the targets
lines below with either hostnames or
IP addresses for your DRBD Reactor monitoring endpoint nodes. Hostnames must be resolvable from
your Prometheus monitoring node.
- job_name: drbd_reactor_endpoint static_configs: - targets: ['node-0:9942'] labels: instance: 'node-0' - targets: ['node-1:9942'] labels: instance: 'node-1' - targets: ['node-2:9942'] labels: instance: 'node-2' [...]
Then, assuming it is already enabled and running, reload the Prometheus service by entering
sudo systemctl reload prometheus.service
.
Next, you can open your Grafana server’s URL with a web browser. If the Grafana server
service is running on the same node as your Prometheus monitoring service, the URL would look
like: http://<node_IP_address_or_hostname>:3000
.
You can then log into the Grafana server web UI, add a Prometheus data source, and then add or import a Grafana dashboard that uses your Prometheus data source. An example dashboard is available at the Grafana Labs dashboards marketplace. An example dashboard is also available as a downloadable JSON file here, at the DRBD Reactor GitHub project site.
9. Integrating DRBD with Pacemaker Clusters
Using DRBD in conjunction with the Pacemaker cluster stack is arguably DRBD’s most frequently found use case. Pacemaker is also one of the applications that make DRBD extremely powerful in a wide variety of usage scenarios.
DRBD can be used in Pacemaker clusters in different ways:
-
DRBD running as a background-service, used like a SAN; or
-
DRBD completely managed by Pacemaker through the DRBD OCF resource agent
Both have a few advantages and disadvantages, these will be discussed below.
It’s recommended to have either fencing configured or quorum enabled. (But not both. External fencing handler results may interact in conflicting ways with DRBD internal quorum.) If your cluster has communication issues (for example, a network switch loses power) and gets split, the parts might start the services (failover) and cause a Split-Brain when the communication resumes again. |
9.1. Introduction to Pacemaker
Pacemaker is a sophisticated, feature-rich, and widely deployed cluster resource manager for the Linux platform. It includes a rich set of documentation. To understand this chapter, reading the following documents is highly recommended:
-
Clusters From Scratch , a step-by-step guide to configuring high-availability clusters;
-
CRM CLI (command line interface) tool, a manual for the CRM shell, a simple and intuitive command line interface bundled with Pacemaker;
-
Pacemaker Configuration Explained, a reference document explaining the concept and design behind Pacemaker.
9.2. Using DRBD as a Background Service in a Pacemaker Cluster
In this section you will see that using autonomous DRBD storage can look like local storage; so integrating in a Pacemaker cluster is done by pointing your mount points at DRBD.
First of all, we will use the auto-promote
feature of DRBD, so that
DRBD automatically sets itself Primary when needed. This will probably
apply to all of your resources, so setting that as a default “yes” in the
common
section makes sense:
common {
options {
auto-promote yes;
...
}
}
Now you just need to use your storage, for example, through a filesystem:
auto-promote
crm configure crm(live)configure# primitive fs_mysql ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/mysql/0" \ directory="/var/lib/mysql" fstype="ext3" crm(live)configure# primitive ip_mysql ocf:heartbeat:IPaddr2 \ params ip="10.9.42.1" nic="eth0" crm(live)configure# primitive mysqld lsb:mysqld crm(live)configure# group mysql fs_mysql ip_mysql mysqld crm(live)configure# commit crm(live)configure# exit bye
Essentially all that is needed is a mountpoint (/var/lib/mysql
in this
example) where the DRBD resource gets mounted.
Provided that Pacemaker has control, it will only allow a single instance of that mount across your cluster.
See also Importing DRBD’s Promotion Scores into the CIB for additional information about ordering constraints for system startup and more.
9.3. Adding a DRBD-backed Service to the Cluster Configuration, Including a Master-Slave Resource
This section explains how to enable a DRBD-backed service in a Pacemaker cluster.
If you are employing the DRBD OCF resource agent, it is recommended that you defer DRBD startup, shutdown, promotion, and demotion exclusively to the OCF resource agent. That means that you should disable the DRBD init script: |
chkconfig drbd off
The ocf:linbit:drbd
OCF resource agent provides Master/Slave
capability, allowing Pacemaker to start and monitor the DRBD resource
on multiple nodes and promoting and demoting as needed. You must,
however, understand that the DRBD RA disconnects and detaches all
DRBD resources it manages on Pacemaker shutdown, and also upon
enabling standby mode for a node.
The OCF resource agent which ships with DRBD belongs to the
linbit provider, and therefore installs as
/usr/lib/ocf/resource.d/linbit/drbd . There is a legacy resource
agent that is included with the OCF resource agents package, which
uses the heartbeat provider and installs into
/usr/lib/ocf/resource.d/heartbeat/drbd . The legacy OCF RA is
deprecated and should no longer be used.
|
To enable a DRBD-backed configuration for a MySQL database in
a Pacemaker CRM cluster with the drbd
OCF resource agent, you must
create both the necessary resources, and Pacemaker constraints to
ensure your service only starts on a previously promoted DRBD
resource. You may do so using the crm
shell, as outlined in the
following example:
master-slave
resourcecrm configure crm(live)configure# primitive drbd_mysql ocf:linbit:drbd \ params drbd_resource="mysql" \ op monitor interval="29s" role="Master" \ op monitor interval="31s" role="Slave" crm(live)configure# ms ms_drbd_mysql drbd_mysql \ meta master-max="1" master-node-max="1" \ clone-max="2" clone-node-max="1" \ notify="true" crm(live)configure# primitive fs_mysql ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/mysql/0" \ directory="/var/lib/mysql" fstype="ext3" crm(live)configure# primitive ip_mysql ocf:heartbeat:IPaddr2 \ params ip="10.9.42.1" nic="eth0" crm(live)configure# primitive mysqld lsb:mysqld crm(live)configure# group mysql fs_mysql ip_mysql mysqld crm(live)configure# colocation mysql_on_drbd \ inf: mysql ms_drbd_mysql:Master crm(live)configure# order mysql_after_drbd \ inf: ms_drbd_mysql:promote mysql:start crm(live)configure# commit crm(live)configure# exit bye
After this, your configuration should be enabled. Pacemaker now selects a node on which it promotes the DRBD resource, and then starts the DRBD-backed resource group on that same node.
See also Importing DRBD’s Promotion Scores into the CIB for additional information about location constraints for placing the Master role.
9.4. Using Resource-level Fencing in Pacemaker Clusters
This section outlines the steps necessary to prevent Pacemaker from promoting a DRBD Master/Slave resource when its DRBD replication link has been interrupted. This keeps Pacemaker from starting a service with outdated data and causing an unwanted “time warp” in the process.
To enable any resource-level fencing for DRBD, you must add the following lines to your resource configuration:
resource <resource> {
net {
fencing resource-only;
...
}
}
You will also have to make changes to the handlers
section depending
on the cluster infrastructure being used.
Corosync-based Pacemaker clusters can use the functionality explained in Resource-level Fencing Using the Cluster Information Base (CIB).
It is absolutely vital to configure at least two independent cluster
communications channels for this functionality to work correctly. Corosync
clusters should list at least two redundant rings in corosync.conf ,
respectively several paths for knet.
|
9.4.1. Resource-level Fencing Using the Cluster Information Base (CIB)
To enable resource-level fencing for Pacemaker, you will have
to set two options in drbd.conf
:
resource <resource> {
net {
fencing resource-only;
...
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";
unfence-peer "/usr/lib/drbd/crm-unfence-peer.9.sh";
# Note: we used to abuse the after-resync-target handler to do the
# unfence, but since 2016 have a dedicated unfence-peer handler.
# Using the after-resync-target handler is wrong in some corner cases.
...
}
...
}
Therefore, if the DRBD replication link becomes disconnected, the
crm-fence-peer.9.sh
script contacts the cluster manager, determines the
Pacemaker Master/Slave resource associated with this DRBD resource,
and ensures that the Master/Slave resource no longer gets promoted on
any node other than the currently active one. Conversely, when the
connection is re-established and DRBD completes its synchronization
process, then that constraint is removed and the cluster manager is
free to promote the resource on any node again.
9.5. Using Stacked DRBD Resources in Pacemaker Clusters
Stacking is deprecated in DRBD version 9.x, as more nodes can be implemented on a single level. See Defining network connections for details. |
Stacked resources allow DRBD to be used for multi-level redundancy in multiple-node clusters, or to establish off-site disaster recovery capability. This section describes how to configure DRBD and Pacemaker in such configurations.
9.5.1. Adding Off-site Disaster Recovery to Pacemaker Clusters
In this configuration scenario, we would deal with a two-node high availability cluster in one site, plus a separate node which would presumably be housed off-site. The third node acts as a disaster recovery node and is a standalone server. Consider the following illustration to describe the concept.
In this example, ‘alice’ and ‘bob’ form a two-node Pacemaker cluster, whereas ‘charlie’ is an off-site node not managed by Pacemaker.
To create such a configuration, you would first configure and initialize DRBD resources as described in Creating a Stacked Three-node Setup. Then, configure Pacemaker with the following CRM configuration:
primitive p_drbd_r0 ocf:linbit:drbd \
params drbd_resource="r0"
primitive p_drbd_r0-U ocf:linbit:drbd \
params drbd_resource="r0-U"
primitive p_ip_stacked ocf:heartbeat:IPaddr2 \
params ip="192.168.42.1" nic="eth0"
ms ms_drbd_r0 p_drbd_r0 \
meta master-max="1" master-node-max="1" \
clone-max="2" clone-node-max="1" \
notify="true" globally-unique="false"
ms ms_drbd_r0-U p_drbd_r0-U \
meta master-max="1" clone-max="1" \
clone-node-max="1" master-node-max="1" \
notify="true" globally-unique="false"
colocation c_drbd_r0-U_on_drbd_r0 \
inf: ms_drbd_r0-U ms_drbd_r0:Master
colocation c_drbd_r0-U_on_ip \
inf: ms_drbd_r0-U p_ip_stacked
colocation c_ip_on_r0_master \
inf: p_ip_stacked ms_drbd_r0:Master
order o_ip_before_r0-U \
inf: p_ip_stacked ms_drbd_r0-U:start
order o_drbd_r0_before_r0-U \
inf: ms_drbd_r0:promote ms_drbd_r0-U:start
Assuming you created this configuration in a temporary file named
/tmp/crm.txt
, you may import it into the live cluster configuration
with the following command:
crm configure < /tmp/crm.txt
This configuration will ensure that the following actions occur in the correct order on the ‘alice’/’bob’ cluster:
-
Pacemaker starts the DRBD resource
r0
on both cluster nodes, and promotes one node to the Master (DRBD Primary) role. -
Pacemaker then starts the IP address 192.168.42.1, which the stacked resource is to use for replication to the third node. It does so on the node it has previously promoted to the Master role for
r0
DRBD resource. -
On the node which now has the Primary role for
r0
and also the replication IP address forr0-U
, Pacemaker now starts ther0-U
DRBD resource, which connects and replicates to the off-site node. -
Pacemaker then promotes the
r0-U
resource to the Primary role too, so it can be used by an application.
Therefore, this Pacemaker configuration ensures that there is not only full data redundancy between cluster nodes, but also to the third, off-site node.
This type of setup is usually deployed together with DRBD Proxy. |
9.5.2. Using Stacked Resources to Achieve Four-way Redundancy in Pacemaker Clusters
In this configuration, a total of three DRBD resources (two unstacked, one stacked) are used to achieve 4-way storage redundancy. This means that of a four-node cluster, up to three nodes can fail while still providing service availability.
Consider the following illustration to explain the concept.
In this example, ‘alice’, ‘bob’, ‘charlie’, and ‘daisy’ form two
two-node Pacemaker clusters. ‘alice’ and ‘bob’ form the cluster named
left
and replicate data using a DRBD resource between them, while
‘charlie’ and ‘daisy’ do the same with a separate DRBD resource, in a
cluster named right
. A third, stacked DRBD resource connects the two
clusters.
Due to limitations in the Pacemaker cluster manager as of Pacemaker version 1.0.5, it is not possible to create this setup in a single four-node cluster without disabling CIB validation, which is an advanced process not recommended for general-purpose use. It is anticipated that this is being addressed in future Pacemaker releases. |
To create such a configuration, you would first configure and
initialize DRBD resources as described in Creating a Stacked Three-node Setup (except
that the remote half of the DRBD configuration is also stacked, not
just the local cluster). Then, configure Pacemaker with the following
CRM configuration, starting with the cluster left
:
primitive p_drbd_left ocf:linbit:drbd \
params drbd_resource="left"
primitive p_drbd_stacked ocf:linbit:drbd \
params drbd_resource="stacked"
primitive p_ip_stacked_left ocf:heartbeat:IPaddr2 \
params ip="10.9.9.100" nic="eth0"
ms ms_drbd_left p_drbd_left \
meta master-max="1" master-node-max="1" \
clone-max="2" clone-node-max="1" \
notify="true"
ms ms_drbd_stacked p_drbd_stacked \
meta master-max="1" clone-max="1" \
clone-node-max="1" master-node-max="1" \
notify="true" target-role="Master"
colocation c_ip_on_left_master \
inf: p_ip_stacked_left ms_drbd_left:Master
colocation c_drbd_stacked_on_ip_left \
inf: ms_drbd_stacked p_ip_stacked_left
order o_ip_before_stacked_left \
inf: p_ip_stacked_left ms_drbd_stacked:start
order o_drbd_left_before_stacked_left \
inf: ms_drbd_left:promote ms_drbd_stacked:start
Assuming you created this configuration in a temporary file named
/tmp/crm.txt
, you may import it into the live cluster configuration
with the following command:
crm configure < /tmp/crm.txt
After adding this configuration to the CIB, Pacemaker will execute the following actions:
-
Bring up the DRBD resource
left
replicating between ‘alice’ and ‘bob’ promoting the resource to the Master role on one of these nodes. -
Bring up the IP address 10.9.9.100 (on either ‘alice’ or ‘bob’, depending on which of these holds the Master role for the resource
left
). -
Bring up the DRBD resource
stacked
on the same node that holds the just-configured IP address. -
Promote the stacked DRBD resource to the Primary role.
Now, proceed on the cluster right
by creating the following
configuration:
primitive p_drbd_right ocf:linbit:drbd \
params drbd_resource="right"
primitive p_drbd_stacked ocf:linbit:drbd \
params drbd_resource="stacked"
primitive p_ip_stacked_right ocf:heartbeat:IPaddr2 \
params ip="10.9.10.101" nic="eth0"
ms ms_drbd_right p_drbd_right \
meta master-max="1" master-node-max="1" \
clone-max="2" clone-node-max="1" \
notify="true"
ms ms_drbd_stacked p_drbd_stacked \
meta master-max="1" clone-max="1" \
clone-node-max="1" master-node-max="1" \
notify="true" target-role="Slave"
colocation c_drbd_stacked_on_ip_right \
inf: ms_drbd_stacked p_ip_stacked_right
colocation c_ip_on_right_master \
inf: p_ip_stacked_right ms_drbd_right:Master
order o_ip_before_stacked_right \
inf: p_ip_stacked_right ms_drbd_stacked:start
order o_drbd_right_before_stacked_right \
inf: ms_drbd_right:promote ms_drbd_stacked:start
After adding this configuration to the CIB, Pacemaker will execute the following actions:
-
Bring up the DRBD resource
right
replicating between ‘charlie’ and ‘daisy’, promoting the resource to the Master role on one of these nodes. -
Bring up the IP address 10.9.10.101 (on either ‘charlie’ or ‘daisy’, depending on which of these holds the Master role for the resource
right
). -
Bring up the DRBD resource
stacked
on the same node that holds the just-configured IP address. -
Leave the stacked DRBD resource in the Secondary role (due to
target-role="Slave"
).
9.6. Configuring DRBD to Replicate Between Two SAN-backed Pacemaker Clusters
This is a somewhat advanced setup usually employed in split-site configurations. It involves two separate Pacemaker clusters, where each cluster has access to a separate Storage Area Network (SAN). DRBD is then used to replicate data stored on that SAN, across an IP link between sites.
Consider the following illustration to describe the concept.
Which of the individual nodes in each site currently acts as the DRBD peer is not explicitly defined — the DRBD peers are said to float; that is, DRBD binds to virtual IP addresses not tied to a specific physical machine.
This type of setup is usually deployed together with DRBD Proxy or truck based replication, or both. |
Since this type of setup deals with shared storage, configuring and testing STONITH is absolutely vital for it to work properly.
9.6.1. DRBD Resource Configuration
To enable your DRBD resource to float, configure it in drbd.conf
in
the following fashion:
resource <resource> {
...
device /dev/drbd0;
disk /dev/sda1;
meta-disk internal;
floating 10.9.9.100:7788;
floating 10.9.10.101:7788;
}
The floating
keyword replaces the on <host>
sections normally
found in the resource configuration. In this mode, DRBD identifies
peers by IP address and TCP port, rather than by host name. It is
important to note that the addresses specified must be virtual cluster
IP addresses, rather than physical node IP addresses, for floating to
function properly. As shown in the example, in split-site
configurations the two floating addresses can be expected to belong to
two separate IP networks — it is therefore vital for routers and firewalls
to properly allow DRBD replication traffic between the nodes.
9.6.2. Pacemaker Resource Configuration
A DRBD floating peers setup, in terms of Pacemaker configuration, involves the following items (in each of the two Pacemaker clusters involved):
-
A virtual cluster IP address.
-
A master/slave DRBD resource (using the DRBD OCF resource agent).
-
Pacemaker constraints ensuring that resources are started on the correct nodes, and in the correct order.
To configure a resource named mysql
in a floating peers
configuration in a 2-node cluster, using the replication address
10.9.9.100
, configure Pacemaker with the following crm
commands:
crm configure crm(live)configure# primitive p_ip_float_left ocf:heartbeat:IPaddr2 \ params ip=10.9.9.100 crm(live)configure# primitive p_drbd_mysql ocf:linbit:drbd \ params drbd_resource=mysql crm(live)configure# ms ms_drbd_mysql drbd_mysql \ meta master-max="1" master-node-max="1" \ clone-max="1" clone-node-max="1" \ notify="true" target-role="Master" crm(live)configure# order drbd_after_left \ inf: p_ip_float_left ms_drbd_mysql crm(live)configure# colocation drbd_on_left \ inf: ms_drbd_mysql p_ip_float_left crm(live)configure# commit bye
After adding this configuration to the CIB, Pacemaker will execute the following actions:
-
Bring up the IP address 10.9.9.100 (on either ‘alice’ or ‘bob’).
-
Bring up the DRBD resource according to the IP address configured.
-
Promote the DRBD resource to the Primary role.
Then, to create the matching configuration in the other cluster, configure that Pacemaker instance with the following commands:
crm configure crm(live)configure# primitive p_ip_float_right ocf:heartbeat:IPaddr2 \ params ip=10.9.10.101 crm(live)configure# primitive drbd_mysql ocf:linbit:drbd \ params drbd_resource=mysql crm(live)configure# ms ms_drbd_mysql drbd_mysql \ meta master-max="1" master-node-max="1" \ clone-max="1" clone-node-max="1" \ notify="true" target-role="Slave" crm(live)configure# order drbd_after_right \ inf: p_ip_float_right ms_drbd_mysql crm(live)configure# colocation drbd_on_right inf: ms_drbd_mysql p_ip_float_right crm(live)configure# commit bye
After adding this configuration to the CIB, Pacemaker will execute the following actions:
-
Bring up the IP address 10.9.10.101 (on either ‘charlie’ or ‘daisy’).
-
Bring up the DRBD resource according to the IP address configured.
-
Leave the DRBD resource in the Secondary role (due to
target-role="Slave"
).
9.6.3. Site Failover
In split-site configurations, it may be necessary to transfer services from one site to another. This may be a consequence of a scheduled migration, or of a disastrous event. In case the migration is a normal, anticipated event, the recommended course of action is this:
-
Connect to the cluster on the site about to relinquish resources, and change the affected DRBD resource’s
target-role
attribute fromMaster
toSlave
. This will shut down any resources depending on the Primary role of the DRBD resource, demote it, and continue to run, ready to receive updates from a new Primary. -
Connect to the cluster on the site about to take over resources, and change the affected DRBD resource’s
target-role
attribute fromSlave
toMaster
. This will promote the DRBD resources, start any other Pacemaker resources depending on the Primary role of the DRBD resource, and replicate updates to the remote site. -
To fail back, simply reverse the procedure.
In case of a catastrophic outage on the active site, it can be expected that the site is offline and no longer replicated to the backup site. In such an event:
-
Connect to the cluster on the still-functioning site resources, and change the affected DRBD resource’s
target-role
attribute fromSlave
toMaster
. This will promote the DRBD resources, and start any other Pacemaker resources depending on the Primary role of the DRBD resource. -
When the original site is restored or rebuilt, you may connect the DRBD resources again, and subsequently fail back using the reverse procedure.
9.7. Importing DRBD’s Promotion Scores into the CIB
Everything described in this section depends on the drbd-attr
OCF resource agent. It is available since drbd-utils version 9.15.0. On
Debian/Ubuntu systems this is part of the drbd-utils package. On RPM
based Linux distributions you need to install the drbd-pacemaker package.
|
Every DRBD resource exposes a promotion score on each node where it is configured. It is a numeric value that might be 0 or positive. The value reflects how desirable it is to promote the resource to master on this particular node. A node that has an UpToDate disk and two UpToDate replicas has a higher score than a node with an UpToDate disk and just one UpToDate replica.
During startup, the promotion score is 0. E.g., before the DRBD device has its backing device attached, or, if quorum is enabled, before quorum is gained. A value of 0 indicates that a promotion request will fail, and is mapped to a pacemaker score that indicates must not run here.
The drbd-attr
OCF resource agent imports these promotion scores into node
attributes of a Pacemaker cluster. It needs to be configured like this:
primitive drbd-attr ocf:linbit:drbd-attr
clone drbd-attr-clone drbd-attr
These are transient attributes (have a lifetime of reboot in pacemaker speak). That means, after
a reboot of the node, or local restart of pacemaker, those attributes will not exist until an instance of drbd-attr
is started on that node.
You can inspect the generated attributes with crm_mon -A -1
.
These attributed can be used in constraints for services that depend on the
DRBD devices, or, when managing DRBD with the ocf:linbit:drbd
resource agent,
for the Master role of that DRBD instance.
Here is an example location constraint for the example resource from Using DRBD as a Background Service in a Pacemaker Cluster
location lo_fs_mysql fs_mysql \
rule -inf: not_defined drbd-promotion-score-mysql \
rule drbd-promotion-score-mysql: defined drbd-promotion-score-mysql
This means, provided that the attribute is not defined, the fs_mysql file system cannot be mounted here. When the attribute is defined, its value becomes the score of the location constraint.
This can also be used to cause Pacemaker to migrate a service away when DRBD loses a local backing device. Because a failed backing block device causes the promotion score to drop, other nodes with working backing devices will expose higher promotion scores.
The attributes are updated live, independent of the resource-agent’s monitor operation, with a dampening delay of 5 seconds by default.
The resource agent has these optional parameters,
see also its man page ocf_linbit_drbd-attr(7)
:
-
dampening_delay
-
attr_name_prefix
-
record_event_details
10. Using LVM with DRBD
This chapter deals with managing DRBD for use with LVM2. In particular, this chapter covers how to:
-
Use LVM Logical Volumes as backing devices for DRBD.
-
Use DRBD devices as Physical Volumes for LVM.
-
Combine these two concepts to implement a layered LVM approach using DRBD.
If you are unfamiliar with these terms, the next section, Introduction to LVM, may serve as a starting point to learn about LVM concepts. However, you are also encouraged to familiarize yourself with LVM in more detail than this section provides.
10.1. Introduction to LVM
LVM2 is an implementation of logical volume management in the context of the Linux device mapper framework. It has practically nothing in common, other than the name and acronym, with the original LVM implementation. The old implementation (now retroactively named “LVM1”) is considered obsolete; it is not covered in this section.
When working with LVM, it is important to understand its most basic concepts:
A PV is an underlying block device exclusively managed by LVM. PVs can either be entire hard disks or individual partitions. It is common practice to create a partition table on the hard disk where one partition is dedicated to the use by the Linux LVM.
The partition type “Linux LVM” (signature 0x8E ) can be used to
identify partitions for exclusive use by LVM. This, however, is not
required — LVM recognizes PVs by way of a signature written to the
device upon PV initialization.
|
A VG is the basic administrative unit of the LVM. A VG may include one or more several PVs. Every VG has a unique name. A VG may be extended during runtime by adding additional PVs, or by enlarging an existing PV.
LVs may be created during runtime within VGs and are available to the other parts of the kernel as regular block devices. As such, they may be used to hold a file system, or for any other purpose block devices may be used for. LVs may be resized while they are online, and they may also be moved from one PV to another (provided that the PVs are part of the same VG).
Snapshots are temporary point-in-time copies of LVs. Creating snapshots is an operation that completes almost instantly, even if the original LV (the origin volume) has a size of several hundred GiByte. Usually, a snapshot requires significantly less space than the original LV.
10.2. Using a Logical Volume as a DRBD Backing Device
Because an existing LVM logical volume is simply a block device in Linux terms, you can use one as a DRBD backing device. To use an LVM logical volume in this way, you simply create one, and then reference it in a DRBD resource configuration just as you would a physical backing disk.
When using DRBD together with LVM, set the global_filter = [ "r|^/dev/drbd|" ] This setting tells LVM to reject DRBD devices from operations such as scanning or opening attempts. In some cases, not setting this filter might lead to increased CPU load or stuck LVM operations. |
The following example assumes that an LVM volume group named foo
already exists on
two nodes of in your LVM-enabled cluster, and that you want to create
a DRBD resource named r0
using a logical volume in that volume
group.
First, you create the logical volume on both nodes in your cluster:
# lvcreate --name bar --size 10G foo Logical volume "bar" created
After this, you will have a block device named
/dev/foo/bar
on both nodes.
Then, you can simply enter the newly created volumes in your resource configuration:
resource r0 {
...
on alice {
device /dev/drbd0;
disk /dev/foo/bar;
...
}
on bob {
device /dev/drbd0;
disk /dev/foo/bar;
...
}
}
Now you can continue to bring your resource up, just as you would if you were using non-LVM block devices.
10.3. Using Automated LVM Snapshots During DRBD Synchronization
While DRBD is synchronizing, the SyncTarget‘s state is Inconsistent until the synchronization completes. If in this situation the SyncSource happens to fail (beyond repair), this puts you in an unfortunate position: the node with good data is dead, and the surviving node has bad (inconsistent) data.
When serving DRBD off an LVM Logical Volume, you can mitigate this problem by creating an automated snapshot when synchronization starts, and automatically removing that same snapshot once synchronization has completed successfully.
To enable automated snapshotting during resynchronization, add the following lines to your resource configuration:
resource r0 { handlers { before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh"; after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh"; } }
The two scripts parse the $DRBD_RESOURCE
environment variable which
DRBD automatically passes to any handler
it invokes. The
snapshot-resync-target-lvm.sh
script then creates an LVM snapshot for
any volume the resource contains, then synchronization
kicks off. In case the script fails, the synchronization does not
commence.
Once synchronization completes, the unsnapshot-resync-target-lvm.sh
script removes the snapshot, which is then no longer needed. In case
unsnapshotting fails, the snapshot continues to linger around.
You should review dangling snapshots as soon as possible. A full snapshot causes both the snapshot itself and its origin volume to fail. |
If at any time your SyncSource does fail beyond repair and you
decide to revert to your latest snapshot on the peer, you may do so by
issuing the lvconvert -M
command.
10.4. Configuring a DRBD Resource as a Physical Volume
To prepare a DRBD resource for use as a Physical Volume, it is necessary to create a PV signature on the DRBD device. To do this, issue one of the following commands on the node where the resource is currently in the primary role:
# pvcreate /dev/drbdX
or
# pvcreate /dev/drbd/by-res/<resource>/0
This example assumes a single-volume resource. |
Now, it is necessary to include this device in the list of devices LVM
scans for PV signatures. To do this, you must edit the LVM
configuration file, normally named
/etc/lvm/lvm.conf
. Find the line in the
devices
section that contains the filter
keyword and edit it
accordingly. If all your PVs are to be stored on DRBD devices, the
following is an appropriate filter
option:
filter = [ "a|drbd.*|", "r|.*|" ]
This filter expression accepts PV signatures found on any DRBD devices, while rejecting (ignoring) all others.
By default, LVM scans all block devices found in /dev for PV
signatures. This is equivalent to filter = [ "a|.*|" ] .
|
If you want to use stacked resources as LVM PVs, then you will need a more explicit filter configuration. You need to verify that LVM detects PV signatures on stacked resources, while ignoring them on the corresponding lower-level resources and backing devices. This example assumes that your lower-level DRBD resources use device minors 0 through 9, whereas your stacked resources are using device minors from 10 upwards:
filter = [ "a|drbd1[0-9]|", "r|.*|" ]
This filter expression accepts PV signatures found only on the DRBD
devices /dev/drbd10
through /dev/drbd19
, while rejecting
(ignoring) all others.
After modifying the lvm.conf
file, you must run the
vgscan
command so LVM
discards its configuration cache and re-scans devices for PV
signatures.
You may of course use a different filter
configuration to match your
particular system configuration. What is important to remember,
however, is that you need to:
-
Accept (include) the DRBD devices that you want to use as PVs.
-
Reject (exclude) the corresponding lower-level devices, so as to avoid LVM finding duplicate PV signatures.
In addition, you should disable the LVM cache by setting:
write_cache_state = 0
After disabling the LVM cache, remove any stale cache
entries by deleting /etc/lvm/cache/.cache
.
You must repeat the above steps on the peer nodes, too.
If your system has its root filesystem on LVM, Volume
Groups will be activated from your initial RAM disk (initrd) during
boot. In doing so, the LVM tools will evaluate an lvm.conf file
included in the initrd image. Therefore, after you make any changes to your
lvm.conf , you should be certain to update your initrd with the
utility appropriate for your distribution (mkinitrd ,
update-initramfs , and so on).
|
When you have configured your new PV, you may proceed to add it to a Volume Group, or create a new Volume Group from it. The DRBD resource must, of course, be in the primary role while doing so.
# vgcreate <name> /dev/drbdX
While it is possible to mix DRBD and non-DRBD Physical Volumes within the same Volume Group, doing so is not recommended and unlikely to be of any practical value. |
When you have created your VG, you may start carving Logical Volumes
out of it, using the lvcreate
command (as with a non-DRBD-backed Volume Group).
10.5. Adding a New DRBD Volume to an Existing Volume Group
Occasionally, you may want to add new DRBD-backed Physical Volumes to a Volume Group. Whenever you do so, a new volume should be added to an existing resource configuration. This preserves the replication stream and ensures write fidelity across all PVs in the VG.
if your LVM volume group is managed by Pacemaker as explained in Highly Available LVM with Pacemaker, it is imperative to place the cluster in maintenance mode prior to making changes to the DRBD configuration. |
Extend your resource configuration to include an additional volume, as in the following example:
resource r0 { volume 0 { device /dev/drbd1; disk /dev/sda7; meta-disk internal; } volume 1 { device /dev/drbd2; disk /dev/sda8; meta-disk internal; } on alice { address 10.1.1.31:7789; } on bob { address 10.1.1.32:7789; } }
Verify that your DRBD configuration is identical across nodes, then issue:
# drbdadm adjust r0
This will implicitly call drbdsetup new-minor r0 1
to enable the new volume 1
in the resource r0
. Once the new
volume has been added to the replication stream, you may initialize
and add it to the volume group:
# pvcreate /dev/drbd/by-res/<resource>/1 # vgextend <name> /dev/drbd/by-res/<resource>/1
This will add the new PV /dev/drbd/by-res/<resource>/1
to the
<name>
VG, preserving write fidelity across the entire VG.
10.6. Highly Available LVM with Pacemaker
The process of transferring volume groups between peers and making the corresponding logical volumes available can be automated. The Pacemaker LVM resource agent is designed for exactly that purpose.
To put an existing, DRBD-backed volume group under Pacemaker
management, run the following commands in the crm
shell:
primitive p_drbd_r0 ocf:linbit:drbd \ params drbd_resource="r0" \ op monitor interval="29s" role="Master" \ op monitor interval="31s" role="Slave" ms ms_drbd_r0 p_drbd_r0 \ meta master-max="1" master-node-max="1" \ clone-max="2" clone-node-max="1" \ notify="true" primitive p_lvm_r0 ocf:heartbeat:LVM \ params volgrpname="r0" colocation c_lvm_on_drbd inf: p_lvm_r0 ms_drbd_r0:Master order o_drbd_before_lvm inf: ms_drbd_r0:promote p_lvm_r0:start commit
After you have committed this configuration, Pacemaker will
automatically make the r0
volume group available on whichever node
currently has the Primary (Master) role for the DRBD resource.
10.7. Using DRBD and LVM Without a Cluster Resource Manager
The typical high availability use case for DRBD is to use a cluster resource manager (CRM) to handle the promoting and demoting of resources, such as DRBD replicated storage volumes. However, it is possible to use DRBD without a CRM.
You might want to do this in a situation when you know that you always want a particular node to promote a DRBD resource and you know that the peer nodes are never going to take over but are only being replicated to for disaster recovery purposes.
In this case, you can use a couple of systemd unit files to handle DRBD resource promotion and make sure that back-end LVM logical volumes are activated first. You also need to make the DRBD systemd unit file for your DRBD resource a dependency of whatever file system mount might be using the DRBD resource as a backing device.
To set this up, for example, given a hypothetical DRBD resource named webdata
and a file system mount point of /var/lib/www
, you might enter the following commands:
# systemctl enable [email protected] # systemctl enable [email protected] # echo "/dev/drbdX /var/lib/www xfs defaults,nofail,[email protected] 0 0" >> /etc/fstab
In this example, the X
in drbdX
is the volume number of your DRBD backing device for the webdata
resource.
The drbd-wait-promotable@<DRBD-resource-name>.service
is a systemd unit file that is used to wait for DRBD to connect to its peers and establish access to good data, before DRBD promotes the resource on the node.
11. Using GFS with DRBD
This chapter outlines the steps necessary to set up a DRBD resource as a block device holding a shared Global File System (GFS). It covers both GFS and GFS2.
To use GFS on top of DRBD, you must configure DRBD in dual-primary mode.
DRBD 9 supports exactly two nodes with its dual-primary mode. Attempting to use more than three nodes in the Primary state is not supported and may lead to data loss. |
All cluster file systems require fencing – not only through the DRBD resource, but STONITH! A faulty member must be killed. You will want these settings: net { fencing resource-and-stonith; } handlers { # Make sure the other node is confirmed # dead after this! fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.9.sh"; } |
If a node becomes a disconnected primary, the resource-and-stonith
network fencing setting will:
-
Freeze all the node’s I/O operations.
-
Call the node’s fence-peer handler.
If the fence-peer handler cannot reach the peer node, for example over an alternate network, then the fence-peer handler should STONITH the disconnected primary node. I/O operations will resume as soon as the situation is resolved.
11.1. Introduction to GFS
The Red Hat Global File System (GFS) is Red Hat’s implementation of a concurrent-access shared storage file system. As any such filesystem, GFS allows multiple nodes to access the same storage device, in read/write fashion, simultaneously without risking data corruption. It does so by using a Distributed Lock Manager (DLM) which manages concurrent access from cluster members.
GFS was designed, from the outset, for use with conventional shared storage devices. Regardless, it is perfectly possible to use DRBD, in dual-primary mode, as a replicated storage device for GFS. Applications may benefit from reduced read/write latency due to the fact that DRBD normally reads from and writes to local storage, as opposed to the SAN devices GFS is normally configured to run from. Also, of course, DRBD adds an additional physical copy to every GFS filesystem, therefore adding redundancy to the concept.
GFS makes use of a cluster-aware variant of LVM, termed Cluster Logical Volume Manager or CLVM. As such, some parallelism exists between using DRBD as the data storage for GFS, and using DRBD as a Physical Volume for conventional LVM.
GFS file systems are usually tightly integrated with Red Hat’s own cluster management framework, the Red Hat Cluster. This chapter explains the use of DRBD in conjunction with GFS in the Red Hat Cluster context.
GFS, CLVM, and Red Hat Cluster are available in Red Hat Enterprise Linux (RHEL) and distributions derived from it, such as CentOS. Packages built from the same sources are also available in Debian GNU/Linux. This chapter assumes running GFS on a Red Hat Enterprise Linux system.
11.2. Creating a DRBD Resource Suitable for GFS
Since GFS is a shared cluster file system expecting concurrent read/write storage access from all cluster nodes, any DRBD resource to be used for storing a GFS filesystem must be configured in dual-primary mode. Also, it is recommended to use some of DRBD’s features for automatic recovery from split brain. To do all this, include the following lines in the resource configuration:
resource <resource> {
net {
allow-two-primaries yes;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
[...]
}
[...]
}
By configuring auto-recovery policies, with the exception of the |
Once you have added these options to your
freshly-configured resource, you may initialize
your resource as you normally would. Since the
allow-two-primaries
option is set to yes
for
this resource, you will be able to promote
the resourceto the primary role on two nodes.
11.3. Configuring LVM to Recognize the DRBD Resource
GFS uses CLVM, the cluster-aware version of LVM, to manage block devices to be used by GFS. To use CLVM with DRBD, ensure that your LVM configuration
-
uses clustered locking. To do this, set the following option in
/etc/lvm/lvm.conf
:locking_type = 3
-
scans your DRBD devices to recognize DRBD-based Physical Volumes (PVs). This applies as to conventional (non-clustered) LVM; see Configuring a DRBD Resource as a Physical Volume for details.
11.4. Configuring Your cluster to Support GFS
After you have created your new DRBD resource and completed your initial cluster configuration, you must enable and start the following system services on both nodes of your GFS cluster:
-
cman
(this also startsccsd
andfenced
), -
clvmd
.
11.5. Creating a GFS Filesystem
To create a GFS filesystem on your dual-primary DRBD resource, you must first initialize it as a Logical Volume for LVM.
Contrary to conventional, non-cluster-aware LVM configurations, the following steps must be completed on only one node due to the cluster-aware nature of CLVM:
# pvcreate /dev/drbd/by-res/<resource>/0 Physical volume "/dev/drbd<num>" successfully created # vgcreate <vg-name> /dev/drbd/by-res/<resource>/0 Volume group "<vg-name>" successfully created # lvcreate --size <size> --name <lv-name> <vg-name> Logical volume "<lv-name>" created
This example assumes a single-volume resource. |
CLVM will immediately notify the peer node of these changes;
issuing lvs
(or lvdisplay
) on the peer node will list the newly created logical
volume.
Now, you may proceed by creating the actual filesystem:
# mkfs -t gfs -p lock_dlm -j 2 /dev/<vg-name>/<lv-name>
Or, for a GFS2 filesystem:
# mkfs -t gfs2 -p lock_dlm -j 2 -t <cluster>:<name> /dev/<vg-name>/<lv-name>
The -j
option in this command refers to the number of journals to
keep for GFS. This must be identical to the number of nodes with concurrent Primary role in the GFS
cluster; since DRBD does not support more than two Primary nodes the value to
set here is always 2.
The -t
option, applicable only for GFS2 filesystems, defines the lock
table name. This follows the format <cluster>:<name>, where <cluster>
must match your cluster name as defined in
/etc/cluster/cluster.conf
. Therefore, only members of that cluster will
be permitted to use the filesystem. By contrast, <name> is an
arbitrary file system name unique in the cluster.
11.6. Using Your GFS Filesystem
After you have created your filesystem, you may add it to
/etc/fstab
:
/dev/<vg-name>/<lv-name> <mountpoint> gfs defaults 0 0
For a GFS2 filesystem, simply change the filesystem type:
/dev/<vg-name>/<lv-name> <mountpoint> gfs2 defaults 0 0
Do not forget to make this change on both cluster nodes.
After this, you may mount your new filesystem by starting the
gfs
service (on both nodes):
# service gfs start
From then on, if you have DRBD configured to start
automatically on system startup, before the Pacemaker services and the
gfs
service, you will be able to use this GFS file system as you
would use one that is configured on traditional shared storage.
12. Using OCFS2 with DRBD
This chapter outlines the steps necessary to set up a DRBD resource as a block device holding a shared Oracle Cluster File System, version 2 (OCFS2).
All cluster file systems require fencing – not only through the DRBD resource, but STONITH! A faulty member must be killed. You’ll want these settings: net { fencing resource-and-stonith; } handlers { # Make sure the other node is confirmed # dead after this! outdate-peer "/sbin/kill-other-node.sh"; } There must be no volatile caches! |
12.1. Introduction to OCFS2
The Oracle Cluster File System, version 2 (OCFS2) is a concurrent access shared storage file system developed by Oracle Corporation. Unlike its predecessor OCFS, which was specifically designed and only suitable for Oracle database payloads, OCFS2 is a general-purpose filesystem that implements most POSIX semantics. The most common use case for OCFS2 is arguably Oracle Real Application Cluster (RAC), but OCFS2 may also be used for load-balanced NFS clusters, for example.
Although originally designed for use with conventional shared storage devices, OCFS2 is equally well suited to be deployed on dual-Primary DRBD. Applications reading from the filesystem may benefit from reduced read latency due to the fact that DRBD reads from and writes to local storage, as opposed to the SAN devices OCFS2 otherwise normally runs on. In addition, DRBD adds redundancy to OCFS2 by adding an additional copy to every filesystem image, as opposed to just a single filesystem image that is merely shared.
Like other shared cluster file systems such as GFS, OCFS2
allows multiple nodes to access the same storage device, in read/write
mode, simultaneously without risking data corruption. It does so by
using a Distributed Lock Manager (DLM) which manages concurrent access
from cluster nodes. The DLM itself uses a virtual file system
(ocfs2_dlmfs
) which is separate from the actual OCFS2 file systems
present on the system.
OCFS2 may either use an intrinsic cluster communication layer to manage cluster membership and filesystem mount and unmount operation, or alternatively defer those tasks to the Pacemakercluster infrastructure.
OCFS2 is available in SUSE Linux Enterprise Server (where it is the primarily supported shared cluster file system), CentOS, Debian GNU/Linux, and Ubuntu Server Edition. Oracle also provides packages for Red Hat Enterprise Linux (RHEL). This chapter assumes running OCFS2 on a SUSE Linux Enterprise Server system.
12.2. Creating a DRBD Resource Suitable for OCFS2
Since OCFS2 is a shared cluster file system expecting concurrent read/write storage access from all cluster nodes, any DRBD resource to be used for storing a OCFS2 filesystem must be configured in dual-primary mode. Also, it is recommended to use some of DRBD’s features for automatic recovery from split brain. To do all this, include the following lines in the resource configuration:
resource <resource> {
net {
# allow-two-primaries yes;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
...
}
...
}
By setting auto-recovery policies, you are effectively configuring automatic data-loss! Be sure you understand the implications. |
It is not recommended to set the allow-two-primaries
option to yes
upon initial configuration. You should do so after the initial
resource synchronization has completed.
Once you have added these options to your
freshly-configured resource, you may initialize
your resource as you normally would. After you set the
allow-two-primaries
option to yes
for this
resource, you will be able to promote the
resourceto the primary role on both nodes.
12.3. Creating an OCFS2 Filesystem
Now, use OCFS2’s mkfs
implementation to create the file system:
# mkfs -t ocfs2 -N 2 -L ocfs2_drbd0 /dev/drbd0 mkfs.ocfs2 1.4.0 Filesystem label=ocfs2_drbd0 Block size=1024 (bits=10) Cluster size=4096 (bits=12) Volume size=205586432 (50192 clusters) (200768 blocks) 7 cluster groups (tail covers 4112 clusters, rest cover 7680 clusters) Journal size=4194304 Initial number of node slots: 2 Creating bitmaps: done Initializing superblock: done Writing system files: done Writing superblock: done Writing backup superblock: 0 block(s) Formatting Journals: done Writing lost+found: done mkfs.ocfs2 successful
This will create an OCFS2 file system with two node slots on
/dev/drbd0
, and set the filesystem label to ocfs2_drbd0
. You may
specify other options on mkfs
invocation; please see the mkfs.ocfs2
system manual page for details.
12.4. Pacemaker OCFS2 Management
12.4.1. Adding a Dual-Primary DRBD Resource to Pacemaker
An existing Dual-Primary DRBD resourcemay
be added to Pacemaker resource management with the following
crm
configuration:
primitive p_drbd_ocfs2 ocf:linbit:drbd \
params drbd_resource="ocfs2"
ms ms_drbd_ocfs2 p_drbd_ocfs2 \
meta master-max=2 clone-max=2 notify=true
Note the master-max=2 meta variable; it enables
dual-Master mode for a Pacemaker master/slave set. This requires that
allow-two-primaries is also set to yes in the DRBD
configuration. Otherwise, Pacemaker will flag a configuration error
during resource validation.
|
12.4.2. Adding OCFS2 Management Capability to Pacemaker
To manage OCFS2 and the kernel Distributed Lock Manager (DLM), Pacemaker uses a total of three different resource agents:
-
ocf:pacemaker:controld
— Pacemaker’s interface to the DLM; -
ocf:ocfs2:o2cb
— Pacemaker’s interface to OCFS2 cluster management; -
ocf:heartbeat:Filesystem
— the generic filesystem management resource agent which supports cluster file systems when configured as a Pacemaker clone.
You may enable all nodes in a Pacemaker cluster for OCFS2 management
by creating a cloned group of resources, with the following
crm
configuration:
primitive p_controld ocf:pacemaker:controld
primitive p_o2cb ocf:ocfs2:o2cb
group g_ocfs2mgmt p_controld p_o2cb
clone cl_ocfs2mgmt g_ocfs2mgmt meta interleave=true
Once this configuration is committed, Pacemaker will start instances
of the controld
and o2cb
resource types on all nodes in the cluster.
12.4.3. Adding an OCFS2 Filesystem to Pacemaker
Pacemaker manages OCFS2 filesystems using the conventional
ocf:heartbeat:Filesystem
resource agent, albeit in clone mode. To
put an OCFS2 filesystem under Pacemaker management, use the following
crm
configuration:
primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/ocfs2/0" directory="/srv/ocfs2" \
fstype="ocfs2" options="rw,noatime"
clone cl_fs_ocfs2 p_fs_ocfs2
This example assumes a single-volume resource. |
12.4.4. Adding Required Pacemaker Constraints to Manage OCFS2 Filesystems
To tie all OCFS2-related resources and clones together, add the following constraints to your Pacemaker configuration:
order o_ocfs2 ms_drbd_ocfs2:promote cl_ocfs2mgmt:start cl_fs_ocfs2:start
colocation c_ocfs2 cl_fs_ocfs2 cl_ocfs2mgmt ms_drbd_ocfs2:Master
12.5. Legacy OCFS2 Management (Without Pacemaker)
The information presented in this section applies to legacy systems where OCFS2 DLM support is not available in Pacemaker. It is preserved here for reference purposes only. New installations should always use the Pacemaker approach. |
12.5.1. Configuring Your Cluster to Support OCFS2
Creating the Configuration File
OCFS2 uses a central configuration file, /etc/ocfs2/cluster.conf
.
When creating your OCFS2 cluster, be sure to add both your hosts to the cluster configuration. The default port (7777) is usually an acceptable choice for cluster interconnect communications. If you choose any other port number, be sure to choose one that does not clash with an existing port used by DRBD (or any other configured TCP/IP).
If you feel less than comfortable editing the cluster.conf
file
directly, you may also use the ocfs2console
graphical configuration
utility which is usually more convenient. Regardless of the approach
you selected, your /etc/ocfs2/cluster.conf
file contents should look
roughly like this:
node:
ip_port = 7777
ip_address = 10.1.1.31
number = 0
name = alice
cluster = ocfs2
node:
ip_port = 7777
ip_address = 10.1.1.32
number = 1
name = bob
cluster = ocfs2
cluster:
node_count = 2
name = ocfs2
When you have configured you cluster, use scp
to
distribute the configuration to both nodes in the cluster.
Configuring the O2CB Driver in SUSE Linux Enterprise Systems
On SLES, you may use the configure
option of the o2cb
init
script:
# /etc/init.d/o2cb configure Configuring the O2CB driver. This will configure the on-boot properties of the O2CB driver. The following questions will determine whether the driver is loaded on boot. The current values will be shown in brackets ('[]'). Hitting <ENTER> without typing an answer will keep that current value. Ctrl-C will abort. Load O2CB driver on boot (y/n) [y]: Cluster to start on boot (Enter "none" to clear) [ocfs2]: Specify heartbeat dead threshold (>=7) [31]: Specify network idle timeout in ms (>=5000) [30000]: Specify network keepalive delay in ms (>=1000) [2000]: Specify network reconnect delay in ms (>=2000) [2000]: Use user-space driven heartbeat? (y/n) [n]: Writing O2CB configuration: OK Loading module "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading module "ocfs2_nodemanager": OK Loading module "ocfs2_dlm": OK Loading module "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Starting O2CB cluster ocfs2: OK
Configuring the O2CB Driver in Debian GNU/Linux Systems
On Debian, the configure
option to /etc/init.d/o2cb
is not
available. Instead, reconfigure the ocfs2-tools
package to enable the
driver:
# dpkg-reconfigure -p medium -f readline ocfs2-tools Configuring ocfs2-tools Would you like to start an OCFS2 cluster (O2CB) at boot time? yes Name of the cluster to start at boot time: ocfs2 The O2CB heartbeat threshold sets up the maximum time in seconds that a node awaits for an I/O operation. After it, the node "fences" itself, and you will probably see a crash. It is calculated as the result of: (threshold - 1) x 2. Its default value is 31 (60 seconds). Raise it if you have slow disks and/or crashes with kernel messages like: o2hb_write_timeout: 164 ERROR: heartbeat write timeout to device XXXX after NNNN milliseconds O2CB Heartbeat threshold: `31` Loading filesystem "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading stack plugin "o2cb": OK Loading filesystem "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Setting cluster stack "o2cb": OK Starting O2CB cluster ocfs2: OK
12.5.2. Using Your OCFS2 Filesystem
When you have completed cluster configuration and created your file system, you may mount it as any other file system:
# mount -t ocfs2 /dev/drbd0 /shared
Your kernel log (accessible by issuing the command dmesg
) should
then contain a line similar to this one:
ocfs2: Mounting device (147,0) on (node 0, slot 0) with ordered data mode.
From that point forward, you should be able to simultaneously mount your OCFS2 filesystem on both your nodes, in read/write mode.
13. Using Xen with DRBD
This chapter outlines the use of DRBD as a Virtual Block Device (VBD) for virtualization environments using the Xen hypervisor.
13.1. Introduction to Xen
Xen is a virtualization framework originally developed at the University of Cambridge (UK), and later being maintained by XenSource, Inc. (now a part of Citrix). It is included in reasonably recent releases of most Linux distributions, such as Debian GNU/Linux (since version 4.0), SUSE Linux Enterprise Server (since release 10), Red Hat Enterprise Linux (since release 5), and many others.
Xen uses paravirtualization — a virtualization method involving a high degree of cooperation between the virtualization host and guest virtual machines — with selected guest operating systems for improved performance in comparison to conventional virtualization solutions (which are typically based on hardware emulation). Xen also supports full hardware emulation on CPUs that support the appropriate virtualization extensions; in Xen parlance, this is known as HVM ( “hardware-assisted virtual machine”).
At the time of writing, CPU extensions supported by Xen for HVM are Intel’s Virtualization Technology (VT, formerly codenamed “Vanderpool”), and AMD’s Secure Virtual Machine (SVM, formerly known as “Pacifica”). |
Xen supports live migration, which refers to the capability of transferring a running guest operating system from one physical host to another, without interruption.
When a DRBD resource is used as a replicated Virtual Block Device (VBD) for Xen, it serves to make the entire contents of a DomU’s virtual disk available on two servers, which can then be configured for automatic failover. That way, DRBD does not only provide redundancy for Linux servers (as in non-virtual DRBD deployment scenarios), but also for any other operating system that can run virtually under Xen — which, in essence, includes any operating system available on 32- or 64-bit Intel compatible architectures.
13.2. Setting DRBD Module Parameters for Use with Xen
For Xen Domain-0 kernels, it is recommended to load the DRBD module
with the parameter disable_sendpage
set to 1
. To do so, create (or open) the file
/etc/modprobe.d/drbd.conf
and enter the following line:
options drbd disable_sendpage=1
13.3. Creating a DRBD Resource Suitable to Act as a Xen Virtual Block Device
Configuring a DRBD resource that is to be used as a Virtual Block Device (VBD) for Xen is fairly straightforward — in essence, the typical configuration matches that of a DRBD resource being used for any other purpose. However, if you want to enable live migration for your guest instance, you need to enable dual-primary modefor this resource:
resource <resource> {
net {
allow-two-primaries yes;
...
}
...
}
Enabling dual-primary mode is necessary because Xen, before initiating live migration, checks for write access on all VBDs a resource is configured to use on both the source and the destination host for the migration.
13.4. Using DRBD Virtual Block Devices
To use a DRBD resource as the virtual block device, you must add a line like the following to your Xen DomU configuration:
disk = [ 'drbd:<resource>,xvda,w' ]
This example configuration makes the DRBD resource named resource
available to the DomU as /dev/xvda
in read/write mode (w
).
Of course, you may use multiple DRBD resources with a single DomU. In
that case, simply add more entries like the one provided in the
example to the disk
option, separated by commas.
There are three sets of circumstances under which you cannot use this approach: |
-
You are configuring a fully virtual (HVM) DomU.
-
You are installing your DomU using a graphical installation utility, and that graphical installer does not support the
drbd:
syntax. -
You are configuring a DomU without the
kernel
,initrd
, andextra
options, relying instead onbootloader
andbootloader_args
to use a Xen pseudo-bootloader, and that pseudo-bootloader does not support thedrbd:
syntax.-
pygrub+ (prior to Xen 3.3) and
domUloader.py
(shipped with Xen on SUSE Linux Enterprise Server 10) are two examples of pseudo-bootloaders that do not support thedrbd:
virtual block device configuration syntax. -
pygrub
from Xen 3.3 forward, and thedomUloader.py
version that ships with SLES 11 do support this syntax.
-
Under these circumstances, you must use the traditional phy:
device
syntax and the DRBD device name that is associated with your resource,
not the resource name. That, however, requires that you manage DRBD
state transitions outside Xen, which is a less flexible approach than
that provided by the drbd
resource type.
13.5. Starting, Stopping, and Migrating DRBD-backed DomUs
Once you have configured your DRBD-backed DomU, you may start it as you would any other DomU:
# xm create <domU> Using config file "/etc/xen/<domU>". Started domain <domU>
In the process, the DRBD resource you configured as the VBD will be promoted to the primary role, and made accessible to Xen as expected.
This is equally straightforward:
# xm shutdown -w <domU> Domain <domU> terminated.
Again, as you would expect, the DRBD resource is returned to the secondary role after the DomU is successfully shut down.
This, too, is done using the usual Xen tools:
# xm migrate --live <domU> <destination-host>
In this case, several administrative steps are automatically taken in rapid succession: * The resource is promoted to the primary role on destination-host. * Live migration of DomU is initiated on the local host. * When migration to the destination host has completed, the resource is demoted to the secondary role locally.
The fact that both resources must briefly run in the primary role on both hosts is the reason for having to configure the resource in dual-primary mode in the first place.
13.6. Internals of DRBD/Xen Integration
Xen supports two Virtual Block Device types natively:
phy
This device type is used to hand “physical” block devices, available in the host environment, off to a guest DomU in an essentially transparent fashion.
file
This device type is used to make file-based block device images
available to the guest DomU. It works by creating a loop block device
from the original image file, and then handing that block device off
to the DomU in much the same fashion as the phy
device type does.
If a Virtual Block Device configured in the disk
option of a DomU
configuration uses any prefix other than phy:
, file:
, or no prefix
at all (in which case Xen defaults to using the phy
device type),
Xen expects to find a helper script named block-
prefix in the Xen
scripts directory, commonly /etc/xen/scripts
.
The DRBD distribution provides such a script for the drbd
device
type, named /etc/xen/scripts/block-drbd
. This script handles the
necessary DRBD resource state transitions as described earlier in this
chapter.
13.7. Integrating Xen with Pacemaker
To fully capitalize on the benefits provided by having a DRBD-backed Xen VBD’s, it is recommended to have Pacemaker manage the associated DomUs as Pacemaker resources.
You may configure a Xen DomU as a Pacemaker resource, and automate
resource failover. To do so, use the Xen OCF resource agent. If you
are using the drbd
Xen device type described in this chapter, you
will not need to configure any separate drbd
resource for use by
the Xen cluster resource. Instead, the block-drbd
helper script will
do all the necessary resource transitions for you.
Optimizing DRBD Performance
14. Measuring Block Device Performance
14.1. Measuring Throughput
When measuring the impact of using DRBD on a system’s I/O throughput, the absolute throughput the system is capable of is of little relevance. What is much more interesting is the relative impact DRBD has on I/O performance. Therefore, it is always necessary to measure I/O throughput both with and without DRBD.
The tests described in this section are intrusive; they overwrite data and bring DRBD devices out of sync. It is therefore vital that you perform them only on scratch volumes which can be discarded after testing has completed. |
I/O throughput estimation works by writing significantly large chunks
of data to a block device, and measuring the amount of time the system
took to complete the write operation. This can be easily done using a
fairly ubiquitous utility, dd
, whose reasonably recent versions
include a built-in throughput estimation.
A simple dd
-based throughput benchmark, assuming you have a scratch
resource named test
, which is currently connected and in the
secondary role on both nodes, is one like the following:
# TEST_RESOURCE=test
# TEST_DEVICE=$(drbdadm sh-dev $TEST_RESOURCE | head -1)
# TEST_LL_DEVICE=$(drbdadm sh-ll-dev $TEST_RESOURCE | head -1)
# drbdadm primary $TEST_RESOURCE
# for i in $(seq 5); do
dd if=/dev/zero of=$TEST_DEVICE bs=1M count=512 oflag=direct
done
# drbdadm down $TEST_RESOURCE
# for i in $(seq 5); do
dd if=/dev/zero of=$TEST_LL_DEVICE bs=1M count=512 oflag=direct
done
This test simply writes 512MiB of data to your DRBD device, and
then to its backing device for comparison. Both tests are repeated 5
times each to allow for some statistical averaging. The relevant
result is the throughput measurements generated by dd
.
For freshly enabled DRBD devices, it is normal to see
slightly reduced performance on the first dd run. This is due
to the Activity Log being “cold”, and is no cause for concern.
|
See our Optimizing DRBD Throughput chapter for some performance numbers.
14.2. Measuring Latency
Latency measurements have objectives completely different from throughput benchmarks: in I/O latency tests, one writes a very small chunk of data (ideally the smallest chunk of data that the system can deal with), and observes the time it takes to complete that write. The process is usually repeated several times to account for normal statistical fluctuations.
Just as throughput measurements, I/O latency measurements may be
performed using the ubiquitous dd
utility, albeit with different
settings and an entirely different focus of observation.
Provided below is a simple dd
-based latency micro-benchmark,
assuming you have a scratch resource named test
which is currently
connected and in the secondary role on both nodes:
# TEST_RESOURCE=test
# TEST_DEVICE=$(drbdadm sh-dev $TEST_RESOURCE | head -1)
# TEST_LL_DEVICE=$(drbdadm sh-ll-dev $TEST_RESOURCE | head -1)
# drbdadm primary $TEST_RESOURCE
# dd if=/dev/zero of=$TEST_DEVICE bs=4k count=1000 oflag=direct
# drbdadm down $TEST_RESOURCE
# dd if=/dev/zero of=$TEST_LL_DEVICE bs=4k count=1000 oflag=direct
This test writes 1,000 chunks with 4kiB each to your DRBD device, and then to its backing device for comparison. 4096 bytes is the smallest block size that a Linux system (on all architectures except s390), modern hard disks, and SSDs, are expected to handle.
It is important to understand that throughput measurements generated
by dd
are completely irrelevant for this test; what is important is
the time elapsed during the completion of said 1,000 writes. Dividing
this time by 1,000 gives the average latency of a single block write.
This is the worst-case, in that it is single-threaded and does one write strictly after the one before, that is, it runs with an I/O-depth of 1. Please refer to Latency Compared to IOPS. |
Furthermore, see our Optimizing DRBD Latency chapter for some typical performance values.
15. Optimizing DRBD Throughput
This chapter deals with optimizing DRBD throughput. It examines some hardware considerations with regard to throughput optimization, and details tuning recommendations for that purpose.
15.1. Hardware Considerations
DRBD throughput is affected by both the bandwidth of the underlying I/O subsystem (disks, controllers, and corresponding caches), and the bandwidth of the replication network.
I/O subsystem throughput is determined, largely, by the number and type of storage units (disks, SSDs, other Flash storage [like FusionIO], …) that can be written to in parallel. A single, reasonably recent, SCSI or SAS disk will typically allow streaming writes of roughly 40MiB/s to the single disk; an SSD will do 300MiB/s; one of the recent Flash storages (NVMe) will be at 1GiB/s. When deployed in a striping configuration, the I/O subsystem will parallelize writes across disks, effectively multiplying a single disk’s throughput by the number of stripes in the configuration. Therefore, the same 40MiB/s disks will allow effective throughput of 120MiB/s in a RAID-0 or RAID-1+0 configuration with three stripes, or 200MiB/s with five stripes; with SSDs, NVMe, or both, you can easily get to 1GiB/sec.
A RAID-controller with RAM and a BBU can speed up short spikes (by buffering them), and so too-short benchmark tests might show speeds like 1GiB/s too; for sustained writes its buffers will just run full, and then not be of much help, though.
Disk mirroring (RAID-1) in hardware typically has little, if any, effect on throughput. Disk striping with parity (RAID-5) does have an effect on throughput, usually an adverse one when compared to striping; RAID-5 and RAID-6 in software even more so. |
Network throughput is usually determined by the
amount of traffic present on the network, and on the throughput of any
routing/switching infrastructure present. These concerns are, however,
largely irrelevant in DRBD replication links which are normally
dedicated, back-to-back network connections. Therefore, network throughput
may be improved either by switching to a higher-throughput hardware
(such as 10 Gigabit Ethernet, or 56GiB InfiniBand), or by using link aggregation over
several network links, as one may do using the Linux
bonding
network driver.
15.2. Estimating DRBD’s Effects on Throughput
When estimating the throughput effects associated with DRBD, it is important to consider the following natural limitations:
-
DRBD throughput is limited by that of the raw I/O subsystem.
-
DRBD throughput is limited by the available network bandwidth.
The lower of these two establishes the theoretical throughput maximum available to DRBD. DRBD then reduces that baseline throughput maximum number by DRBD’s additional I/O activity, which can be expected to be less than three percent of the baseline number.
-
Consider the example of two cluster nodes containing I/O subsystems capable of 600 MB/s throughput, with a Gigabit Ethernet link available between them. Gigabit Ethernet can be expected to produce 110 MB/s throughput for TCP connections, therefore the network connection would be the bottleneck in this configuration and one would expect about 110 MB/s maximum DRBD throughput.
-
By contrast, if the I/O subsystem is capable of only 80 MB/s for sustained writes, then it constitutes the bottleneck, and you should expect only about 77 MB/s maximum DRBD throughput.
15.3. Tuning Recommendations
DRBD offers several configuration options which may have an effect on your system’s throughput. This section list some recommendations for tuning for throughput. However, since throughput is largely hardware dependent, the effects of tweaking the options described here may vary greatly from system to system. It is important to understand that these recommendations should not be interpreted as “silver bullets” which would magically remove any and all throughput bottlenecks.
15.3.1. Setting max-buffers
and max-epoch-size
These options affect write performance on the secondary
nodes. max-buffers
is the maximum number of buffers DRBD allocates for
writing data to disk while max-epoch-size
is the maximum number of
write requests permitted between two write barriers. max-buffers
must be
equal or bigger to max-epoch-size
to increase performance.
The default for both is 2048; setting it to around
8000 should be fine for most reasonably high-performance hardware RAID
controllers.
resource <resource> {
net {
max-buffers 8000;
max-epoch-size 8000;
...
}
...
}
15.3.2. Tuning the TCP Send Buffer Size
The TCP send buffer is a memory buffer for outgoing TCP traffic. By default, it is set to a size of 128 KiB. For use in high-throughput networks (such as dedicated Gigabit Ethernet or load-balanced bonded connections), it may make sense to increase this to a size of 2MiB, or perhaps even more. Send buffer sizes of more than 16MiB are generally not recommended (and are also unlikely to produce any throughput improvement).
resource <resource> {
net {
sndbuf-size 2M;
...
}
...
}
DRBD also supports TCP send buffer auto-tuning. After enabling this feature, DRBD will dynamically select an appropriate TCP send buffer size. TCP send buffer auto tuning is enabled by simply setting the buffer size to zero:
resource <resource> {
net {
sndbuf-size 0;
...
}
...
}
Please note that your sysctl
‘s settings net.ipv4.tcp_rmem
and
net.ipv4.tcp_wmem
will still influence the behaviour; you should check
these settings, and perhaps set them similar to 131072 1048576 16777216
(minimum 128kiB, default 1MiB, max 16MiB).
net.ipv4.tcp_mem is a different beast, with a different unit –
do not touch, wrong values can easily push your machine into out-of-memory
situations!
|
15.3.3. Tuning the Activity Log Size
If the application using DRBD is write intensive in the sense that it frequently issues small writes scattered across the device, it is usually advisable to use a fairly large activity log. Otherwise, frequent metadata updates may be detrimental to write performance.
resource <resource> {
disk {
al-extents 6007;
...
}
...
}
15.3.4. Disabling Barriers and Disk Flushes
The recommendations outlined in this section should be applied only to systems with non-volatile (battery backed) controller caches. |
Systems equipped with battery backed write cache come with built-in means of protecting data in the face of power failure. In that case, it is permissible to disable some of DRBD’s own safeguards created for the same purpose. This may be beneficial in terms of throughput:
resource <resource> {
disk {
disk-barrier no;
disk-flushes no;
...
}
...
}
15.4. Achieving Better Read Performance Through Increased Redundancy
As detailed in the man page of drbd.conf
under read-balancing
,
you can increase your read performance by adding more copies of your data.
As a ballpark figure: with a single node processing read requests, fio
on
a FusionIO card gave us 100k IOPS; after enabling read-balancing
, the
performance jumped to 180k IOPS, i.e. +80%!
So, in case you’re running a read-mostly workload (big databases with many
random reads), it might be worth a try to turn read-balancing
on – and,
perhaps, add another copy for still more read IO throughput.
16. Optimizing DRBD Latency
This chapter deals with optimizing DRBD latency. It examines some hardware considerations with regard to latency minimization, and details tuning recommendations for that purpose.
16.1. Hardware Considerations
DRBD latency is affected by both the latency of the underlying I/O subsystem (disks, controllers, and corresponding caches), and the latency of the replication network.
For rotating media the I/O subsystem latency is primarily a function of disk rotation speed. Therefore, using fast-spinning disks is a valid approach for reducing I/O subsystem latency.
For solid state media (like SSDs) the Flash storage controller is the determining factor; the next most important thing is the amount of unused capacity. Using DRBD’s Trim and Discard Support will help you provide the controller with the needed information which blocks it can recycle. That way, when a write requests comes in, it can use a block that got cleaned ahead-of-time and doesn’t have to wait now until there’s space available[11].
Likewise, the use of a battery-backed write cache (BBWC) reduces write completion times, also reducing write latency. Most reasonable storage subsystems come with some form of battery-backed cache, and allow the administrator to configure which portion of this cache is used for read and write operations. The recommended approach is to disable the disk read cache completely and use all available cache memory for the disk write cache.
Network latency is, in essence, the packet round-trip time (RTT) between hosts. It is influenced by several factors, most of which are irrelevant on the dedicated, back-to-back network connections recommended for use as DRBD replication links. Therefore, it is sufficient to accept that a certain amount of latency always exists in network links, which typically is on the order of 100 to 200 microseconds (μs) packet RTT for Gigabit Ethernet.
Network latency may typically be pushed below this limit only by using lower-latency network protocols, such as running DRBD over Dolphin Express using Dolphin SuperSockets, or a 10GBe direct connection; these are typically in the 50µs range. Even better is InfiniBand, which provides even lower latency.
16.2. Estimating DRBD’s Effects on Latency
As for throughput, when estimating the latency effects associated with DRBD, there are some important natural limitations to consider:
-
DRBD latency is bound by that of the raw I/O subsystem.
-
DRBD latency is bound by the available network latency.
The sum of the two establishes the theoretical latency minimum incurred to DRBD[12]. DRBD then adds to that latency a slight additional latency, which can be expected to be less than one percent.
-
Consider the example of a local disk subsystem with a write latency of 3ms and a network link with one of 0.2ms. Then the expected DRBD latency would be 3.2 ms or a roughly 7-percent latency increase over just writing to a local disk.
Latency may be influenced by several other factors, including CPU cache misses, context switches, and others. |
16.3. Latency Compared to IOPS
IOPS is the abbreviation of “I/O operations per second“.
Marketing typically doesn’t like numbers that get smaller; press releases aren’t written with “Latency reduced by 10µs, from 50µs to 40µs now!” in mind, they like “Performance increased by 25%, from 20000 to now 25000 IOPS” much more. Therefore IOPS were invented – to get a number that says “higher is better”.
So, IOPS are the reciprocal of latency. The method in Measuring Latency gives you a latency measurement based on the number of IOPS for a purely sequential, single-threaded I/O load. Most other documentation will give measurements for some highly parallel I/O load[13], because this gives much larger numbers.
So, please don’t shy away from measuring serialized, single-threaded latency.
If you want a large IOPS number, run the fio
utility with threads=8
and an
iodepth=16
, or some similar settings… But please remember that these
numbers will not have any meaning to your setup, unless you’re driving
a database with many tens or hundreds of client connections active at the same
time.
16.4. Tuning Recommendations
16.4.1. Setting DRBD’s CPU Mask
DRBD allows you to set an explicit CPU mask for its kernel threads. By default, DRBD picks a single CPU for each resource. All the threads for this resource run on this CPU. This policy is generally optimal when the goal is maximum aggregate performance with more DRBD resources than CPU cores. If instead you want to maximize the performance of individual resources at the cost of total CPU usage, you can use the CPU mask parameter to allow the DRBD threads to use multiple CPUs.
In addition, for detailed fine-tuning, you can coordinate the placement of application threads with the corresponding DRBD threads. Depending on the behavior of the application and the optimization goals, it may be beneficial to either use the same CPU, or to separate the threads onto independent CPUs, that is, restrict DRBD from using the same CPUs that are used by the application.
The CPU mask value that you set in a DRBD resource configuration is a hex number (or else a string of comma-separated hex numbers, to specify a mask that includes a system’s 33rd CPU core or beyond). You can specify a mask that has up to a maximum of 908 CPU cores.
When represented in binary, the least significant bit of the CPU mask represents the first CPU, the second-least significant bit the second CPU, and so forth, up to a maximum of 908 CPU cores. A set bit (1) in the binary representation of the mask means that DRBD can use the corresponding CPU. A cleared bit (0) means that DRBD cannot use the corresponding CPU.
For example, a CPU mask of 0x1 (00000001
in binary) means DRBD can use the first CPU only. A mask of
0xC (00001100
in binary) means that DRBD can use the third and fourth CPU.
To convert a binary mask value to the hex value (or string of hex values) needed for your DRBD resource configuration
file, you can use the following commands, provided that you have the bc
utility installed. For
example, to get the hex value for the binary number 00001100 and apply the necessary formatting
for the CPU mask value string, enter the following:
$ binmask=00001100 $ echo "obase=16;ibase=2;$binmask" | BC_LINE_LENGTH=0 bc | \ sed ':a;s/\([^,]\)\([^,]\{8\}\)\($\|,\)/\1,\2\3/;p;ta;s/,0\+/,/g' | tail -n 1
The sed command above transforms the resulting hex number (converted from the binary
number in the binmask variable, into a string format that the function that parses the
cpu-mask string expects.
|
Output from these commands will be C
. You can then use this value in your resource
configuration file, as follows, to limit DRBD to only use the third and fourth CPU cores:
resource <resource> {
options {
cpu-mask C;
...
}
...
}
If you need to specify a mask that represents more than 32 CPUs then you will need to use a comma separated list of 32 bit hex values[14], up to a maximum of 908 CPU cores. A comma must separate every group of eight hex digits (32 binary digits) in the string.
For a contrived, more complex example, if you wanted to restrict DRBD to using just the 908th, 35th, 34th, 5th, 2nd, and 1st CPUs, you would set your CPU mask as follows:
$ binmask=10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011000000000000000000000000000010011 $ echo "obase=16;ibase=2;$binmask" | BC_LINE_LENGTH=0 bc | \ sed ':a;s/\([^,]\)\([^,]\{8\}\)\($\|,\)/\1,\2\3/;p;ta;s/,0\+/,/g' | tail -n 1
Output from this command will be:
$ 800,,,,,,,,,,,,,,,,,,,,,,,,,,,6,13
You would then set the CPU mask parameter in your resource configuration to:
cpu-mask 800,,,,,,,,,,,,,,,,,,,,,,,,,,,6,13
Of course, to minimize CPU competition between DRBD and the application using it, you need to configure your application to use only those CPUs which DRBD does not use. Some applications might provide for this through an entry in a configuration
file, just like DRBD itself. Others include an invocation of the
|
It makes sense to keep the DRBD threads running on the same L2/L3 caches. However, the numbering of CPUs doesn’t have to correlate with the physical partitioning.
You can try the |
16.4.2. Modifying the Network MTU
It may be beneficial to change the replication network’s maximum transmission unit (MTU) size to a value higher than the default of 1500 bytes. Colloquially, this is referred to as “enabling Jumbo frames”.
The MTU may be changed using the following commands:
# ifconfig <interface> mtu <size>
or
# ip link set <interface> mtu <size>
<interface> refers to the network interface used for DRBD replication. A typical value for <size> would be 9000 (bytes).
16.4.3. Enabling the Deadline I/O Scheduler
When used in conjunction with high-performance, write back enabled hardware RAID controllers, DRBD latency may benefit greatly from using the simple deadline I/O scheduler, rather than the CFQ scheduler. The latter is typically enabled by default.
Modifications to the I/O scheduler configuration may be performed through
the sysfs
virtual file system, mounted at /sys
. The scheduler
configuration is in /sys/block/device
, where <device> is the
backing device DRBD uses.
You can enable the deadline scheduler with the following command:
# echo deadline > /sys/block/<device>/queue/scheduler
You may then also set the following values, which may provide additional latency benefits:
-
Disable front merges:
# echo 0 > /sys/block/<device>/queue/iosched/front_merges
-
Reduce read I/O deadline to 150 milliseconds (the default is 500ms):
# echo 150 > /sys/block/<device>/queue/iosched/read_expire
-
Reduce write I/O deadline to 1500 milliseconds (the default is 3000ms):
# echo 1500 > /sys/block/<device>/queue/iosched/write_expire
If these values effect a significant latency improvement, you may want
to make them permanent so they are automatically set at system
startup. Debian and Ubuntu systems provide this functionality through the
sysfsutils
package and the /etc/sysfs.conf
configuration file.
You may also make a global I/O scheduler selection by passing the
elevator
parameter through your kernel command line. To do so, edit your
boot loader configuration (normally found in /etc/default/grub
if
you are using the GRUB boot loader) and add elevator=deadline
to your
list of kernel boot options.
Learning More
17. DRBD Internals
This chapter gives some background information about some of DRBD’s internal algorithms and structures. It is intended for interested users wishing to gain a certain degree of background knowledge about DRBD. It does not dive into DRBD’s inner workings deep enough to be a reference for DRBD developers. For that purpose, please refer to the papers listed in Publications, and of course to the comments in the DRBD source code.
17.1. DRBD Metadata
DRBD stores various pieces of information about the data it keeps in a dedicated area. This metadata includes:
-
the size of the DRBD device,
-
the Generation Identifier (GI, described in detail in Generation Identifiers),
-
the Activity Log (AL, described in detail in The Activity Log).
-
the quick-sync bitmap (described in detail in The Quick-sync Bitmap),
This metadata may be stored internally or externally. Which method is used is configurable on a per-resource basis.
17.1.1. Internal Metadata
Configuring a resource to use internal metadata means that DRBD stores its metadata on the same physical lower-level device as the actual production data. It does so by setting aside an area at the end of the device for the specific purpose of storing metadata.
Since the metadata are inextricably linked with the actual data, no special action is required from the administrator in case of a hard disk failure. The metadata are lost together with the actual data and are also restored together.
In case of the lower-level device being a single physical hard disk (as opposed to a RAID set), internal metadata may negatively affect write throughput. The performance of write requests by the application may trigger an update of the metadata in DRBD. If the metadata are stored on the same magnetic disk of a hard disk, the write operation may result in two additional movements of the write/read head of the hard disk.
If you are planning to use internal metadata in conjunction with an existing lower-level device that already has data that you want to preserve, you must account for the space required by DRBD’s metadata. Otherwise, upon DRBD resource creation, the newly created metadata would overwrite data at the end of the lower-level device, potentially destroying existing files in the process. |
To avoid that, you must do one of the following things:
-
Enlarge your lower-level device. This is possible with any logical volume management facility (such as LVM) provided that you have free space available in the corresponding volume group. It may also be supported by hardware storage solutions.
-
Shrink your existing file system on your lower-level device. This may or may not be supported by your file system.
-
If neither of the two are possible, use external metadata instead.
To estimate the amount by which you must enlarge your lower-level device or shrink your file system, see Estimating Metadata Size.
17.1.2. External Metadata
External metadata is simply stored on a separate, dedicated block device distinct from that which holds your production data.
For some write operations, using external metadata produces a somewhat improved latency behavior.
Meta data are not inextricably linked with the actual production data. This means that manual intervention is required in the case of a hardware failure destroying just the production data (but not DRBD metadata), to effect a full data sync from the surviving node onto the subsequently replaced disk.
Use of external metadata is also the only viable option if all of the following apply:
-
You are using DRBD to duplicate an existing device that already contains data you want to preserve, and
-
that existing device does not support enlargement, and
-
the existing file system on the device does not support shrinking.
To estimate the required size of the block device dedicated to hold your device metadata, see Estimating Metadata Size.
External metadata requires a minimum of a 1MB device size. |
17.1.3. Estimating Metadata Size
You may calculate the exact space requirements for DRBD’s metadata using the following formula:
Cs is the data device size in sectors, and N is the number of peers.
You may retrieve the device size (in bytes) by issuing blockdev --getsize64
<device> ; to convert to MB, divide by 1048576 (= 220 or 10242).
|
In practice, you may use a reasonably good approximation, given below. Note that in this formula, the unit is megabytes, not sectors:
17.2. Generation Identifiers
DRBD uses generation identifiers (GIs) to identify “generations” of replicated data.
This is DRBD’s internal mechanism used for
-
determining whether the two nodes are in fact members of the same cluster (as opposed to two nodes that were connected accidentally),
-
determining the direction of background re-synchronization (if necessary),
-
determining whether full re-synchronization is necessary or whether partial re-synchronization is sufficient,
-
identifying split brain.
17.2.1. Data Generations
DRBD marks the start of a new data generation at each of the following occurrences:
-
The initial device full sync,
-
a disconnected resource switching to the primary role,
-
a resource in the primary role disconnecting.
Therefore, we can summarize that whenever a resource is in the Connected connection state, and both nodes’ disk state is UpToDate, the current data generation on both nodes is the same. The inverse is also true. Note that the current implementation uses the lowest bit to encode the role of the node (Primary/Secondary). Therefore, the lowest bit might be different on distinct nodes even if they are considered to have the same data generation.
Every new data generation is identified by an 8-byte, universally unique identifier (UUID).
17.2.2. The Generation Identifier Tuple
DRBD keeps some pieces of information about current and historical data generations in the local resource metadata:
This is the generation identifier for the current data generation, as seen from the local node’s perspective. When a resource is Connected and fully synchronized, the current UUID is identical between nodes.
This is the UUID of the generation against which this on-disk bitmap is tracking changes (per remote host). Like the on-disk sync bitmap itself, this identifier is only relevant while the remote host is disconnected.
These are the identifiers of data generations preceding the current one, sized to have one slot per (possible) remote host.
Collectively, these items are referred to as the generation identifier tuple, or “GI tuple” for short.
17.2.3. How Generation Identifiers Change
Start of a New Data Generation
When a node in Primary role loses connection to its peer (either by network failure or manual intervention), DRBD modifies its local generation identifiers in the following manner:
-
The primary creates a new UUID for the new data generation. This becomes the new current UUID for the primary node.
-
The previous current UUID now refers to the generation the bitmap is tracking changes against, so it becomes the new bitmap UUID for the primary node.
-
On the secondary node(s), the GI tuple remains unchanged.
17.2.4. How DRBD Uses Generation Identifiers
When a connection between nodes is established, the two nodes exchange their currently available generation identifiers, and proceed accordingly. Several possible outcomes exist:
The local node detects that both its current UUID and the peer’s current UUID are empty. This is the normal occurrence for a freshly configured resource that has not had the initial full sync initiated. No synchronization takes place; it has to be started manually.
The local node detects that the peer’s current UUID is empty, and its own is not. This is the normal case for a freshly configured resource on which the initial full sync has just been initiated, the local node having been selected as the initial synchronization source. DRBD now sets all bits in the on-disk sync bitmap (meaning it considers the entire device out-of-sync), and starts synchronizing as a synchronization source. In the opposite case (local current UUID empty, peer’s non-empty), DRBD performs the same steps, except that the local node becomes the synchronization target.
The local node detects that its current UUID and the peer’s current UUID are non-empty and equal. This is the normal occurrence for a resource that went into disconnected mode at a time when it was in the secondary role, and was not promoted on either node while disconnected. No synchronization takes place, as none is necessary.
The local node detects that its bitmap UUID matches the peer’s current UUID, and that the peer’s bitmap UUID is empty. This is the normal and expected occurrence after a secondary node failure, with the local node being in the primary role. It means that the peer never became primary in the meantime and worked on the basis of the same data generation all along. DRBD now initiates a normal, background re-synchronization, with the local node becoming the synchronization source. If, conversely, the local node detects that its bitmap UUID is empty, and that the peer’s bitmap matches the local node’s current UUID, then that is the normal and expected occurrence after a failure of the local node. Again, DRBD now initiates a normal, background re-synchronization, with the local node becoming the synchronization target.
The local node detects that its current UUID matches one of the peer’s historical UUIDs. This implies that while the two data sets share a common ancestor, and the peer node has the up-to-date data, the information kept in the peer node’s bitmap is outdated and not usable. Therefore, a normal synchronization would be insufficient. DRBD now marks the entire device as out-of-sync and initiates a full background re-synchronization, with the local node becoming the synchronization target. In the opposite case (one of the local node’s historical UUID matches the peer’s current UUID), DRBD performs the same steps, except that the local node becomes the synchronization source.
The local node detects that its current UUID differs from the peer’s current UUID, and that the bitmap UUIDs match. This is split brain, but one where the data generations have the same parent. This means that DRBD invokes split brain auto-recovery strategies, if configured. Otherwise, DRBD disconnects and waits for manual split brain resolution.
The local node detects that its current UUID differs from the peer’s current UUID, and that the bitmap UUIDs do not match. This is split brain with unrelated ancestor generations, therefore auto-recovery strategies, even if configured, are moot. DRBD disconnects and waits for manual split brain resolution.
Finally, in case DRBD fails to detect even a single matching element in the two nodes’ GI tuples, it logs a warning about unrelated data and disconnects. This is DRBD’s safeguard against accidental connection of two cluster nodes that have never heard of each other before.
17.3. The Activity Log
17.3.1. Purpose
During a write operation DRBD forwards the write operation to the local backing block device, but also sends the data block over the network. These two actions occur, for all practical purposes, simultaneously. Random timing behavior may cause a situation where the write operation has been completed, but the transmission over the network has not yet taken place, or vice versa.
If, at this moment, the active node fails and failover is being initiated, then this data block is out of sync between nodes — it has been written on the failed node prior to the failure, but replication has not yet completed. Therefore, when the node eventually recovers, this block must be removed from the data set during subsequent synchronization. Otherwise, the failed node would be “one write ahead” of the surviving node, which would violate the “all or nothing” principle of replicated storage. This is an issue that is not limited to DRBD, in fact, this issue exists in practically all replicated storage configurations. Many other storage solutions (just as DRBD itself, prior to version 0.7) therefore require that after a failure of the active node the data must be fully synchronized after its recovery.
DRBD’s approach, since version 0.7, is a different one. The activity log (AL), stored in the metadata area, keeps track of those blocks that have “recently” been written to. Colloquially, these areas are referred to as hot extents.
If a temporarily failed node that was in active mode at the time of failure is synchronized, only those hot extents highlighted in the AL need to be synchronized (plus any blocks marked in the bitmap on the now-active peer), rather than the full device. This drastically reduces synchronization time after an active node failure.
17.3.2. Active Extents
The activity log has a configurable parameter, the number of active extents. Every active extent adds 4MiB to the amount of data being retransmitted after a Primary failure. This parameter must be understood as a compromise between the following opposites:
Keeping a large activity log improves write throughput. Every time a new extent is activated, an old extent is reset to inactive. This change requires a write operation to the metadata area. If the number of active extents is high, old active extents are swapped out fairly rarely, reducing metadata write operations and thereby improving performance.
Keeping a small activity log reduces synchronization time after active node failure and subsequent recovery.
17.3.3. Selecting a Suitable Activity Log Size
Consideration of the number of extents should be based on the desired synchronization time at a given synchronization rate. The number of active extents can be calculated as follows:
R is the synchronization rate, given in MiB/s. tsync is the target synchronization time, in seconds. E is the resulting number of active extents.
To provide an example, suppose the cluster has an I/O subsystem with a throughput rate of 200 MiByte/s that was configured to a synchronization rate (R) of 60 MiByte/s, and we want to keep the target synchronization time (tsync) at 4 minutes or 240 seconds:
On a final note, DRBD 9 needs to keep an AL even on the Secondary nodes, as their data might be used to synchronize other Secondary nodes.
17.4. The Quick-sync Bitmap
The quick-sync bitmap is the internal data structure which DRBD uses, on a per-resource per-peer basis, to keep track of blocks being in sync (identical on both nodes) or out-of sync. It is only relevant when a resource is in disconnected mode.
In the quick-sync bitmap, one bit represents a 4-KiB chunk of on-disk data. If the bit is cleared, it means that the corresponding block is still in sync with the peer node. That implies that the block has not been written to since the time of disconnection. Conversely, if the bit is set, it means that the block has been modified and needs to be re-synchronized whenever the connection becomes available again.
As DRBD detects write I/O on a disconnected device, and therefore starts setting bits in the quick-sync bitmap, it does so in RAM — therefore avoiding expensive synchronous metadata I/O operations. Only when the corresponding blocks turn cold (that is, expire from the Activity Log), DRBD makes the appropriate modifications in an on-disk representation of the quick-sync bitmap. Likewise, if the resource happens to be manually shut down on the remaining node while disconnected, DRBD flushes the complete quick-sync bitmap out to persistent storage.
When the peer node recovers or the connection is re-established, DRBD combines the bitmap information from both nodes to determine the total data set that it must re-synchronize. Simultaneously, DRBD examines the generation identifiers to determine the direction of synchronization.
The node acting as the synchronization source then transmits the agreed-upon blocks to the peer node, clearing sync bits in the bitmap as the synchronization target acknowledges the modifications. If the re-synchronization is now interrupted (by another network outage, for example) and subsequently resumed it will continue where it left off — with any additional blocks modified in the meantime being added to the re-synchronization data set, of course.
Re-synchronization may be also be paused and resumed manually
with the drbdadm pause-sync and
drbdadm resume-sync commands. You
should, however, not do so light-heartedly — interrupting
re-synchronization leaves your secondary node’s disk
Inconsistent longer than necessary.
|
17.5. The Peer-fencing Interface
DRBD has an interface defined for fencing[15] the peer
node in case of the replication link being interrupted. The fence-peer
should mark the disk(s) on the peer node as Outdated, or shut down
the peer node. It has to fulfill these tasks under the assumption that
the replication network is down.
The fencing helper is invoked only in case
-
a
fence-peer
handler has been defined in the resource’s (orcommon
)handlers
section, and -
the
fencing
option for the resource is set to eitherresource-only
orresource-and-stonith
, and -
the node was primary and the replication link is interrupted long enough for DRBD[16] to detect a network failure. or
-
the node should promote to primary and is not connected to the peer and the peer’s disks are not already marked as Outdated.
The program or script specified as the fence-peer
handler, when it is
invoked, has the DRBD_RESOURCE
and DRBD_PEER
environment variables
available. They contain the name of the affected DRBD resource and the
peer’s hostname, respectively.
Any peer fencing helper program (or script) must return one of the following exit codes:
Exit code | Implication |
---|---|
3 |
Peer’s disk state was already Inconsistent. |
4 |
Peer’s disk state was successfully set to Outdated (or was Outdated to begin with). |
5 |
Connection to the peer node failed, peer could not be reached. |
6 |
Peer refused to be outdated because the affected resource was in the primary role. |
7 |
Peer node was successfully fenced off the cluster. This should never occur unless |
17.6. The Client Mode
Since version 9.0.13 DRBD supports clients. A client in DRBD speak is
just a permanently diskless node. In the configuration, it is
expressed by using the keyword none
for the backing block device
(the disk
keyword). You will notice that in the drbdsetup status
output you will see the Diskless
disk status displayed in green
color. (Usually, a disk state of Diskless
is displayed in red).
Internally all the peers of an intentional diskless node are
configured with the peer-device-option
--bitmap=no
. That means
that they will not allocate a bitmap slot in the meta-data for the
intentional diskless peer. On the intentional diskless node the device
gets marked with the option --diskless=yes
while it is created with
the new-minor
sub-command of drbdsetup
.
These flags are visible through the events2
status command:
-
a
device
might have theclient:
field. If it reportsyes
the local device was marked to be permanently diskless. -
a
peer-device
might have thepeer-client
filed. If it saysyes
then there is no change-tracking bitmap to that peer.
Relevant commands and implications:
-
You can only run
drbdsetup peer-device-options --bitmap=yes …
if bitmap slots are available in the meta-data, since a bitmap-slot gets allocated. -
The command
drbdsetup peer-device-options --bitmap=no …
is only possible if the peer is diskless, it does not unallocate the bitmap-slot. -
drbdsetup forget-peer …
is used to irrevocable free the bitmap-slot assigned to a certain peer. -
Connecting two peers with disk where one (or both) expect the peer to be permanently diskless fails.
18. Getting More Information
18.1. Commercial DRBD Support
Commercial DRBD support, consultation, and training services are available from the project’s sponsor company, LINBIT.
18.2. Public Mailing List
The public mailing list for general usage questions regarding DRBD is [email protected]. This is a subscribers-only mailing list, you may subscribe at https://lists.linbit.com/listinfo/drbd-user/. A complete list archive is available at https://lists.linbit.com/pipermail/drbd-user/.
18.3. Official Twitter Account
LINBIT maintains an official Twitter account.
If you tweet about DRBD, please include the #drbd
hashtag.
18.4. Publications
DRBD’s authors have written and published several papers on DRBD in general, or a specific aspect of DRBD. Here is a short selection:
-
Philipp Reisner. DRBD 9 – What’s New, 2012.
-
Lars Ellenberg. DRBD v8.0.x and beyond, 2007.
-
Philipp Reisner. DRBD v8 – Replicated Storage with Shared Disk Semantics, 2007.
-
Philipp Reisner. Rapid resynchronization for replicated storage, 2006.
18.5. Other Useful Resources
-
In addition to the user’s guide you are reading now, LINBIT provides more documentation such as how-to guides, video tutorials, and explanatory blog posts that are available through the LINBIT website:
-
User’s guides: https://linbit.com/user-guides-and-product-documentation
-
How-to guides: https://linbit.com/solutions-and-how-to-documentation
-
Blog: https://linbit.com/blog
-
-
LINBIT offers an official training course for DRBD: DRBD Basics.
-
Wikipedia has an entry on DRBD.
-
The ClusterLabs website has useful information about using DRBD in high-availability clusters.
Appendices
Appendix A: Recent Changes
This appendix is for users who upgrade from earlier DRBD versions to DRBD 9.x. It highlights some important changes to DRBD’s configuration and behavior.
A.1. DRBD 9.2 Changelog
You can find an itemized list of updates, fixes, and changes to the DRBD 9.2 branch at the project’s open source codebase repository: https://github.com/LINBIT/drbd/blob/master/ChangeLog
Some of the highlights include:
-
Add RDMA transport.
-
Allow resync to proceed even with continuous application I/O.
-
Process control socket packets directly in “bottom half” context. This improves performance by decreasing latency.
-
Perform more discards when resyncing. Resync in multiples of discard granularity.
-
Support network namespaces, for better integration with containers and orchestrators such as Kubernetes.
A.2. DRBD 9.1 Changelog
You can find an itemized list of updates, fixes, and changes to the DRBD 9.1 branch at the project’s open source codebase repository: https://github.com/LINBIT/drbd/blob/drbd-9.1/ChangeLog
Some of the highlights include:
-
Reduce locking contention in sending path. This increases performance of workloads with multiple peers or high I/O depth.
-
Improve support for various scenarios involving suspended I/O due to loss of quorum.
A.3. Changes Coming From DRBD 8.4
If you are coming to DRBD 9.x from DRBD 8.4, some noteworthy changes are detailed in the following subsections.
A.3.1. Connections
With DRBD 9, data can be replicated across more than two nodes.
This also means that stacking DRBD volumes is now deprecated (though still possible), and that using DRBD as a network-blockdevice (a DRBD client) now makes sense.
Associated with this change are:
-
Metadata size changes (one bitmap per peer).
-
/proc/drbd
now only gives minimal information, seedrbdadm status
. -
Resynchronization to or from multiple peers is possible.
-
The activity log is used even when in Secondary role.
A.3.2. Auto-Promote Feature
DRBD 9 can be configured to do the Primary/Secondary role switch automatically, on-demand.
This feature replaces both the become-primary-on
configuration value, as well
as the old Heartbeat v1 drbddisk
script.
See Automatic Promotion of Resources for more details.
Appendix B: Upgrading DRBD From 8.4 to 9.x
This section covers the process of upgrading DRBD from version 8.4.x to 9.x in detail. For upgrades within version 9, and for special considerations when upgrading to a particular DRBD 9.x version, refer to the Upgrading DRBD chapter in this guide.
B.1. Compatibility
DRBD 9.a.b releases are generally protocol compatible with DRBD 8.c.d. In particular, all DRBD 9.a.b releases other than DRBD 9.1.0 to 9.1.7 inclusive are compatible with DRBD 8.c.d.
B.2. General Overview
The general process for upgrading 8.4 to 9.x is as follows:
-
Configure the new repositories (if using packages from LINBIT).
-
Verify that the current situation is okay.
-
Pause any cluster manager.
-
Upgrade packages to install new versions .
-
If you want to move to more than two nodes, you will need to resize the lower-level storage to provide room for the additional metadata. This topic is discussed in the LVM Chapter.
-
Unconfigure resources, unload DRBD 8.4, and load the v9 kernel module.
-
Convert DRBD metadata to format
v09
, perhaps changing the number of bitmaps in the same step. -
Start the DRBD resources and bring the cluster node online again if you are using a cluster manager.
B.3. Updating Your Repository
Due to the number of changes between the 8.4 and 9.x branches, LINBIT has created separate repositories for each. The best way to get LINBIT’s software installed on your machines, if you have a LINBIT customer or evaluation account, is to download a small Python helper script and run it on your target machines.
B.3.1. Using the LINBIT Manage Node Helper Script to Enable LINBIT Repositories
Running the LINBIT helper script will allow you to enable certain LINBIT package repositories. When upgrading
from DRBD 8.4, it is recommended that you enable the drbd-9
package repository.
While the helper script does give you the option of enabling a drbd-9.0 package
repository, this is not recommended as a way to upgrade from DRBD 8.4, as that branch only contains DRBD 9.0 and related software. It will
likely be discontinued in the future and the DRBD versions 9.1+ that are available in the drbd-9 package repository are protocol compatible with version
8.4.
|
To use the script to enable the drbd-9
repository, refer to the instructions in this guide for
Using a LINBIT Helper Script to
Register Nodes and Configure Package Repositories
B.3.2. Debian/Ubuntu Systems
When using LINBIT package repositories to update DRBD 8.4 to 9.1+, note that LINBIT currently only keeps two LTS Ubuntu versions up-to-date: Focal (20.04) and Jammy (22.04). If you are running DRBD v8.4, you are likely on an older version of Ubuntu Linux than these. Before using the helper script to add LINBIT package repositories to update DRBD, you would first need to update your system to a LINBIT supported LTS version.
B.4. Checking the DRBD State
Before you update DRBD, verify that your resources are in sync. The output of cat /proc/drbd
should show an UpToDate/UpToDate status for your resources.
node-2# cat /proc/drbd version: 8.4.9-1 (api:1/proto:86-101) GIT-hash: [...] build by linbit@buildsystem, 2016-11-18 14:49:21 GIT-hash: [...] build by linbit@buildsystem, 2016-11-18 14:49:21 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r----- ns:0 nr:211852 dw:211852 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
The cat /proc/drbd command is deprecated in DRBD versions 9.x for getting resource
status information. After upgrading DRBD, use the drbdadm status command to get resource
status information.
|
B.5. Pausing the Services
Now that you know the resources are in sync, start by upgrading the secondary node. This can be done manually or according to your cluster manager’s documentation. Both processes are covered below. If you are running Pacemaker as your cluster manager do not use the manual method.
B.6. Upgrading the Packages
Now update your packages.
RHEL/CentOS:
node-2# dnf -y upgrade
Debian/Ubuntu:
node-2# apt-get update && apt-get upgrade
Once the upgrade is finished you will have the latest DRBD 9.x kernel
module and drbd-utils
installed on your secondary node, node-2
.
But the kernel module is not active yet.
B.7. Loading the New Kernel Module
By now the DRBD module should not be in use anymore, so unload it by entering the following command:
node-2# rmmod drbd_transport_tcp; rmmod drbd
If there is a message like ERROR: Module drbd is in use
, then not all
resources have been correctly stopped.
Retry upgrading packages, or run the command drbdadm down all
to find
out which resources are still active.
Some typical issues that might prevent you from unloading the kernel module are:
-
NFS export on a DRBD-backed filesystem (see
exportfs -v
output) -
Filesystem still mounted – check
grep drbd /proc/mounts
-
Loopback device active (
losetup -l
) -
Device mapper using DRBD, directly or indirectly (
dmsetup ls --tree
) -
LVM with a DRBD-PV (
pvs
)
This list is not complete. These are just the most common examples. |
Now you can load the new DRBD module.
node-2# modprobe drbd
Next, you can verify that the version of the DRBD kernel module that is loaded is the updated
9.x version. If the installed package is for the wrong kernel version, the modprobe
would be
successful, but output from a drbdadm --version
command would show that the DRBD kernel
version (DRBD_KERNEL_VERSION_CODE
) was still at the older 8.4 (0x08040
in hexadecimal)
version.
The output of drbdadm --version
should show 9.x.y and look similar
to this:
DRBDADM_BUILDTAG=GIT-hash:\ [...]\ build\ by\ @buildsystem\,\ 2022-09-19\ 12:15:10 DRBDADM_API_VERSION=2 DRBD_KERNEL_VERSION_CODE=0x09010b DRBD_KERNEL_VERSION=9.1.11 DRBDADM_VERSION_CODE=0x091600 DRBDADM_VERSION=9.22.0
On the primary node, node-1 , drbdadm --version will still show the
|
B.8. Migrating Your Configuration Files
DRBD 9.x is backward compatible with the 8.4 configuration files;
however, some
syntax has changed. See Changes to the Configuration Syntax for
a full list of changes. In the meantime you can port your old
configs fairly easily by using drbdadm dump all
command. This
will output both a new global configuration followed by the
new resource configuration files. Take this output and make changes
accordingly.
B.9. Changing the Metadata
Now you need to convert the on-disk metadata to the new version. You can do this by using the
drbdadm create-md
command and answering two questions.
If you want to change the number of nodes, you should already have increased
the size of the lower level device, so that there is enough space to store the
additional bitmaps; in that case, you would run the command below with an
additional argument --max-peers=<N>
. When determining the number of
(possible) peers please take setups like the DRBD Client into account.
# drbdadm create-md <resource> You want me to create a v09 style flexible-size internal meta data block. There appears to be a v08 flexible-size internal meta data block already in place on <disk> at byte offset <offset> Valid v08 meta-data found, convert to v09? [need to type 'yes' to confirm] yes md_offset <offsets...> al_offset <offsets...> bm_offset <offsets...> Found some data ==> This might destroy existing data! <== Do you want to proceed? [need to type 'yes' to confirm] yes Writing meta data... New drbd meta data block successfully created. success
Of course, you can pass all
for the resource names, too. And if you feel
lucky, brave, or both you can avoid the questions by using the --force
flag like this:
drbdadm -v --max-peers=<N> -- --force create-md <resources>
The order of these arguments is important. Make sure you understand the potential data loss implications of this command before you enter it. |
B.10. Starting DRBD Again
Now, the only thing left to do is to get the DRBD devices up and running again. You can do this by using the drbdadm up all
command.
Next, depending on whether you are using a cluster manager or if you keep track of your DRBD resources manually, there are two different ways to bring up your resources. If you are using a cluster manager follow its documentation.
-
Manually
node-2# systemctl start drbd@<resource>.target
-
Pacemaker
# crm node online node-2
This should make DRBD connect to the other node, and the resynchronization process will start.
When the two nodes are UpToDate on all resources again, you can move your
applications to the already upgraded node (here node-2
), and then follow the
same steps on the cluster node still running version 8.4.
ping
.
--force
flag. It’s assumed that before you use the --force
flag, you know what you are doing.
bitmap_parse
function to provide the CPU mask parameter functionality. See the Linux kernel documentation for the bitmap_parse
function: here.
ping-timeout
, or the kernel triggers a connection abort, perhaps as a result of the network link going down.