[DRBD](http://drbd.org/) is basically "RAID over the network", the ability to
replicate block devices over multiple machines. It's used extensively
in our [howto/ganeti](howto/ganeti) configuration to replicate virtual machines across
multiple hosts.

[[_TOC_]]

# How-to

## Checking status

Just like `mdadm`, there's a device in `/proc` which shows the status
of the RAID configuration. This is a healthy configuration:

    # cat /proc/drbd
    version: 8.4.10 (api:1/proto:86-101)
    srcversion: 9B4D87C5E865DF526864868 
     0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
        ns:0 nr:10821208 dw:10821208 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
     1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
        ns:0 nr:10485760 dw:10485760 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
     2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
        ns:0 nr:1048580 dw:1048580 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

Keyword: `UpToDate`. This is a configuration that is being resync'd:

    version: 8.4.10 (api:1/proto:86-101)
    srcversion: 9B4D87C5E865DF526864868 
     0: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
        ns:0 nr:9352840 dw:9352840 dr:0 al:8 bm:0 lo:1 pe:3 ua:0 ap:0 ep:1 wo:f oos:1468352
    	[================>...] sync'ed: 86.1% (1432/10240)M
    	finish: 0:00:36 speed: 40,436 (38,368) want: 61,440 K/sec
     1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
        ns:0 nr:8439808 dw:8439808 dr:0 al:8 bm:0 lo:1 pe:3 ua:0 ap:0 ep:1 wo:f oos:2045952
    	[===============>....] sync'ed: 80.6% (1996/10240)M
    	finish: 0:00:52 speed: 39,056 (37,508) want: 61,440 K/sec
     2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
        ns:0 nr:1048580 dw:1048580 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

See [the upstream documentation](https://docs.linbit.com/docs/users-guide-8.3/p-work/) for details on this output.

The [drbdmon](http://manpages.debian.org/drbdmon) command also provides a similar view but, in my
opinion, less readable.

Because DRBD is built with kernel modules, you can also see activity
in the `dmesg` logs

## Finding device associated with host

In the drbd status, devices are shown by their `minor` identifier. For
example, this is device minor id 18 having a trouble of some sort:

    18: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
        ns:1237956 nr:0 dw:11489220 dr:341910 al:177 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
    	[===================>] sync'ed:100.0% (0/10240)M
    	finish: 0:00:00 speed: 764 (768) K/sec (stalled)

Finding which host is associated with this device is easy: just call
`list-drbd`:

    root@fsn-node-01:~# gnt-node list-drbd fsn-node-01 | grep 18
    fsn-node-01.torproject.org    18 gettor-01.torproject.org          disk/0 primary   fsn-node-02.torproject.org

It's the host `gettor-01`.

## Deleting a stray device

If Ganeti tried to create a device on one node but couldn't reach the
other node (for example if the secondary IP on the other node wasn't
set correctly), you will see this error in Ganeti:

       - ERROR: node chi-node-03.torproject.org: unallocated drbd minor 0 is in use

You can confirm this by looking at the `/proc/drbd` there:

    root@chi-node-03:~# cat /proc/drbd 
    version: 8.4.10 (api:1/proto:86-101)
    srcversion: 473968AD625BA317874A57E 
     0: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown   r-----
        ns:0 nr:0 dw:0 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:10485504

And confirm the device does not exist on the other side:

    root@chi-node-04:~# cat /proc/drbd 
    version: 8.4.10 (api:1/proto:86-101)
    srcversion: 473968AD625BA317874A57E

The device can therefore be deleted on the `chi-node-03` side. First
detach it:

    drbdsetup detach /dev/drbd0

Then delete it:

    drbdsetup del-minor 0

## Pager playbook

### Resyncing disks

In Nagios, if you see this warning:

    DRBD CRITICAL: Device 10 WFConnection UpToDate, Device 9 WFConnection UpToDate

It means that, on that host (in my case it was
`fsn-node-04.torproject.org`), disks are desynchronized for some
reason. In this case, those are disks 9 and 10. You can confirm that
on the host:

    # ssh fsn-node-04.torproject.org cat /proc/drbd
    [...]
     9: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
    ns:13799284 nr:0 dw:272704248 dr:15512933 al:1331 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:8343096
    10: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
    ns:2097152 nr:0 dw:2097192 dr:2102652 al:9 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:40
    [...]

You need to find which instance this disk is associated with (see also
above):

    $ ssh fsn-node-01.torproject.org gnt-node list-drbd fsn-node-04
    [...]
    Node                       Minor Instance                            Disk   Role      PeerNode
    [...]
    fsn-node-04.torproject.org     9 onionoo-frontend-01.torproject.org  disk/0 primary   fsn-node-03.torproject.org
    fsn-node-04.torproject.org    10 onionoo-frontend-01.torproject.org  disk/1 primary   fsn-node-03.torproject.org
    [...]

Then you can "reactivate" the disks simply by telling ganeti:

    $ ssh fsn-node-01.torproject.org gnt-instance activate-disks onionoo-frontend-01.torproject.org

And then the disk will resync.

## Upstream documentation

 * [User guide](https://docs.linbit.com/docs/users-guide-8.3/)
 * [upstream intro](https://docs.linbit.com/docs/users-guide-8.3/p-intro/)
 * [troubleshooting](https://docs.linbit.com/docs/users-guide-8.3/p-work/#ch-troubleshooting)

# Reference

## Installation

The `ganeti` Puppet module takes care of basic DRBD configuration, by
installing the right software (`drbd-utils`) and kernel
modules. Everything else is handled automatically by Ganeti itself.

There's a Nagios check for the DRBD service that ensures devices are
synchronized. It will yield an `UNKNOWN` status when no device is
created, so it's expected that new nodes are flagged until they host
some content. The check is shipped as part of `tor-nagios-checks`, as
`dsa-check-drbd`, see [dsa-check-drbd](https://gitweb.torproject.org/admin/tor-nagios.git/plain/tor-nagios-checks/checks/dsa-check-drbd).