[DRBD](http://drbd.org/) is basically "RAID over the network", the ability to replicate block devices over multiple machines. It's used extensively in our [howto/ganeti](howto/ganeti) configuration to replicate virtual machines across multiple hosts. [[_TOC_]] # How-to ## Checking status Just like `mdadm`, there's a device in `/proc` which shows the status of the RAID configuration. This is a healthy configuration: # cat /proc/drbd version: 8.4.10 (api:1/proto:86-101) srcversion: 9B4D87C5E865DF526864868 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:10821208 dw:10821208 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:10485760 dw:10485760 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:1048580 dw:1048580 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 Keyword: `UpToDate`. This is a configuration that is being resync'd: version: 8.4.10 (api:1/proto:86-101) srcversion: 9B4D87C5E865DF526864868 0: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r----- ns:0 nr:9352840 dw:9352840 dr:0 al:8 bm:0 lo:1 pe:3 ua:0 ap:0 ep:1 wo:f oos:1468352 [================>...] sync'ed: 86.1% (1432/10240)M finish: 0:00:36 speed: 40,436 (38,368) want: 61,440 K/sec 1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r----- ns:0 nr:8439808 dw:8439808 dr:0 al:8 bm:0 lo:1 pe:3 ua:0 ap:0 ep:1 wo:f oos:2045952 [===============>....] sync'ed: 80.6% (1996/10240)M finish: 0:00:52 speed: 39,056 (37,508) want: 61,440 K/sec 2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:1048580 dw:1048580 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 See [the upstream documentation](https://docs.linbit.com/docs/users-guide-8.3/p-work/) for details on this output. The [drbdmon](http://manpages.debian.org/drbdmon) command also provides a similar view but, in my opinion, less readable. Because DRBD is built with kernel modules, you can also see activity in the `dmesg` logs ## Finding device associated with host In the drbd status, devices are shown by their `minor` identifier. For example, this is device minor id 18 having a trouble of some sort: 18: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----- ns:1237956 nr:0 dw:11489220 dr:341910 al:177 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 [===================>] sync'ed:100.0% (0/10240)M finish: 0:00:00 speed: 764 (768) K/sec (stalled) Finding which host is associated with this device is easy: just call `list-drbd`: root@fsn-node-01:~# gnt-node list-drbd fsn-node-01 | grep 18 fsn-node-01.torproject.org 18 gettor-01.torproject.org disk/0 primary fsn-node-02.torproject.org It's the host `gettor-01`. ## Deleting a stray device If Ganeti tried to create a device on one node but couldn't reach the other node (for example if the secondary IP on the other node wasn't set correctly), you will see this error in Ganeti: - ERROR: node chi-node-03.torproject.org: unallocated drbd minor 0 is in use You can confirm this by looking at the `/proc/drbd` there: root@chi-node-03:~# cat /proc/drbd version: 8.4.10 (api:1/proto:86-101) srcversion: 473968AD625BA317874A57E 0: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown r----- ns:0 nr:0 dw:0 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:10485504 And confirm the device does not exist on the other side: root@chi-node-04:~# cat /proc/drbd version: 8.4.10 (api:1/proto:86-101) srcversion: 473968AD625BA317874A57E The device can therefore be deleted on the `chi-node-03` side. First detach it: drbdsetup detach /dev/drbd0 Then delete it: drbdsetup del-minor 0 ## Pager playbook ### Resyncing disks In Nagios, if you see this warning: DRBD CRITICAL: Device 10 WFConnection UpToDate, Device 9 WFConnection UpToDate It means that, on that host (in my case it was `fsn-node-04.torproject.org`), disks are desynchronized for some reason. In this case, those are disks 9 and 10. You can confirm that on the host: # ssh fsn-node-04.torproject.org cat /proc/drbd [...] 9: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----- ns:13799284 nr:0 dw:272704248 dr:15512933 al:1331 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:8343096 10: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----- ns:2097152 nr:0 dw:2097192 dr:2102652 al:9 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:40 [...] You need to find which instance this disk is associated with (see also above): $ ssh fsn-node-01.torproject.org gnt-node list-drbd fsn-node-04 [...] Node Minor Instance Disk Role PeerNode [...] fsn-node-04.torproject.org 9 onionoo-frontend-01.torproject.org disk/0 primary fsn-node-03.torproject.org fsn-node-04.torproject.org 10 onionoo-frontend-01.torproject.org disk/1 primary fsn-node-03.torproject.org [...] Then you can "reactivate" the disks simply by telling ganeti: $ ssh fsn-node-01.torproject.org gnt-instance activate-disks onionoo-frontend-01.torproject.org And then the disk will resync. ## Upstream documentation * [User guide](https://docs.linbit.com/docs/users-guide-8.3/) * [upstream intro](https://docs.linbit.com/docs/users-guide-8.3/p-intro/) * [troubleshooting](https://docs.linbit.com/docs/users-guide-8.3/p-work/#ch-troubleshooting) # Reference ## Installation The `ganeti` Puppet module takes care of basic DRBD configuration, by installing the right software (`drbd-utils`) and kernel modules. Everything else is handled automatically by Ganeti itself. There's a Nagios check for the DRBD service that ensures devices are synchronized. It will yield an `UNKNOWN` status when no device is created, so it's expected that new nodes are flagged until they host some content. The check is shipped as part of `tor-nagios-checks`, as `dsa-check-drbd`, see [dsa-check-drbd](https://gitweb.torproject.org/admin/tor-nagios.git/plain/tor-nagios-checks/checks/dsa-check-drbd).