|
|
[DRBD](http://drbd.org/) is basically "RAID over the network", the ability to
|
|
|
replicate block devices over multiple machines. It's used extensively
|
|
|
in our [howto/ganeti](howto/ganeti) configuration to replicate virtual machines across
|
|
|
multiple hosts.
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
How-to
|
|
|
======
|
|
|
|
|
|
Checking status
|
|
|
---------------
|
|
|
|
|
|
Just like `mdadm`, there's a device in `/proc` which shows the status
|
|
|
of the RAID configuration. This is a healthy configuration:
|
|
|
|
|
|
# cat /proc/drbd
|
|
|
version: 8.4.10 (api:1/proto:86-101)
|
|
|
srcversion: 9B4D87C5E865DF526864868
|
|
|
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
|
|
|
ns:0 nr:10821208 dw:10821208 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
|
|
|
1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
|
|
|
ns:0 nr:10485760 dw:10485760 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
|
|
|
2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
|
|
|
ns:0 nr:1048580 dw:1048580 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
|
|
|
|
|
|
Keyword: `UpToDate`. This is a configuration that is being resync'd:
|
|
|
|
|
|
version: 8.4.10 (api:1/proto:86-101)
|
|
|
srcversion: 9B4D87C5E865DF526864868
|
|
|
0: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
|
|
|
ns:0 nr:9352840 dw:9352840 dr:0 al:8 bm:0 lo:1 pe:3 ua:0 ap:0 ep:1 wo:f oos:1468352
|
|
|
[================>...] sync'ed: 86.1% (1432/10240)M
|
|
|
finish: 0:00:36 speed: 40,436 (38,368) want: 61,440 K/sec
|
|
|
1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
|
|
|
ns:0 nr:8439808 dw:8439808 dr:0 al:8 bm:0 lo:1 pe:3 ua:0 ap:0 ep:1 wo:f oos:2045952
|
|
|
[===============>....] sync'ed: 80.6% (1996/10240)M
|
|
|
finish: 0:00:52 speed: 39,056 (37,508) want: 61,440 K/sec
|
|
|
2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
|
|
|
ns:0 nr:1048580 dw:1048580 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
|
|
|
|
|
|
See [the upstream documentation](https://docs.linbit.com/docs/users-guide-8.3/p-work/) for details on this output.
|
|
|
|
|
|
The [drbdmon](http://manpages.debian.org/drbdmon) command also provides a similar view but, in my
|
|
|
opinion, less readable.
|
|
|
|
|
|
Because DRBD is built with kernel modules, you can also see activity
|
|
|
in the `dmesg` logs
|
|
|
|
|
|
## Finding device associated with host
|
|
|
|
|
|
In the drbd status, devices are shown by their `minor` identifier. For
|
|
|
example, this is device minor id 18 having a trouble of some sort:
|
|
|
|
|
|
18: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
|
|
|
ns:1237956 nr:0 dw:11489220 dr:341910 al:177 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
|
|
|
[===================>] sync'ed:100.0% (0/10240)M
|
|
|
finish: 0:00:00 speed: 764 (768) K/sec (stalled)
|
|
|
|
|
|
Finding which host is associated with this device is easy: just call
|
|
|
`list-drbd`:
|
|
|
|
|
|
root@fsn-node-01:~# gnt-node list-drbd fsn-node-01 | grep 18
|
|
|
fsn-node-01.torproject.org 18 gettor-01.torproject.org disk/0 primary fsn-node-02.torproject.org
|
|
|
|
|
|
It's the host `gettor-01`.
|
|
|
|
|
|
## Pager playbook
|
|
|
|
|
|
### Resyncing disks
|
|
|
|
|
|
In Nagios, if you see this warning:
|
|
|
|
|
|
DRBD CRITICAL: Device 10 WFConnection UpToDate, Device 9 WFConnection UpToDate
|
|
|
|
|
|
It means that, on that host (in my case it was
|
|
|
`fsn-node-04.torproject.org`), disks are desynchronized for some
|
|
|
reason. In this case, those are disks 9 and 10. You can confirm that
|
|
|
on the host:
|
|
|
|
|
|
# ssh fsn-node-04.torproject.org cat /proc/drbd
|
|
|
[...]
|
|
|
9: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
|
|
|
ns:13799284 nr:0 dw:272704248 dr:15512933 al:1331 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:8343096
|
|
|
10: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
|
|
|
ns:2097152 nr:0 dw:2097192 dr:2102652 al:9 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:40
|
|
|
[...]
|
|
|
|
|
|
You need to find which instance this disk is associated with (see also
|
|
|
above):
|
|
|
|
|
|
$ ssh fsn-node-01.torproject.org gnt-node list-drbd fsn-node-04
|
|
|
[...]
|
|
|
Node Minor Instance Disk Role PeerNode
|
|
|
[...]
|
|
|
fsn-node-04.torproject.org 9 onionoo-frontend-01.torproject.org disk/0 primary fsn-node-03.torproject.org
|
|
|
fsn-node-04.torproject.org 10 onionoo-frontend-01.torproject.org disk/1 primary fsn-node-03.torproject.org
|
|
|
[...]
|
|
|
|
|
|
Then you can "reactivate" the disks simply by telling ganeti:
|
|
|
|
|
|
$ ssh fsn-node-01.torproject.org gnt-instance activate-disks onionoo-frontend-01.torproject.org
|
|
|
|
|
|
And then the disk will resync.
|
|
|
|
|
|
## Upstream documentation
|
|
|
|
|
|
* [User guide](https://docs.linbit.com/docs/users-guide-8.3/)
|
|
|
* [upstream intro](https://docs.linbit.com/docs/users-guide-8.3/p-intro/)
|
|
|
* [troubleshooting](https://docs.linbit.com/docs/users-guide-8.3/p-work/#ch-troubleshooting)
|
|
|
|
|
|
Reference
|
|
|
=========
|
|
|
|
|
|
Installation
|
|
|
------------
|
|
|
|
|
|
The `ganeti` Puppet module takes care of basic DRBD configuration, by
|
|
|
installing the right software (`drbd-utils`) and kernel
|
|
|
modules. Everything else is handled automatically by Ganeti itself.
|
|
|
|
|
|
There's a Nagios check for the DRBD service that ensures devices are
|
|
|
synchronized. It will yield an `UNKNOWN` status when no device is
|
|
|
created, so it's expected that new nodes are flagged until they host
|
|
|
some content. The check is shipped as part of `tor-nagios-checks`, as
|
|
|
`dsa-check-drbd`, see [dsa-check-drbd](https://gitweb.torproject.org/admin/tor-nagios.git/plain/tor-nagios-checks/checks/dsa-check-drbd). |