Skip to content
Snippets Groups Projects
raid.mdwn 4.66 KiB
Newer Older
  • Learn to ignore specific revisions
  • [[!toc levels=3]]
    
    Software RAID
    =============
    
    Replacing a drive
    -----------------
    
    If a drive fails in a server, the procedure is essentially to open a
    ticket, wait for the drive change, partition and re-add it to the RAID
    array. The following procdure assumes that `sda` failed and `sdb` is
    good in a RAID-1 array, but can vary with other RAID configurations or
    drive models.
    
     1. file a ticket upstream
    
        [Hetzner Support](https://robot.your-server.de/support/), for example, has an excellent service which
        asks you the disk serial number (available in the SMART email
        notification) and the SMART log (output of `smartctl -x
        /dev/sda`). Then they will turn off the machine, replace the disk,
        and start it up again.
    
     2. wait for the server to return with the new disk
    
        Hetzner will send an email to the tpa alias when that is done.
    
     3. partition the new drive (`sda`) to match the old (`sdb`):
    
            sfdisk -d /dev/sdb | sfdisk --no-reread /dev/sda --force
    
     4. re-add the new disk to the RAID array:
    
            mdadm /dev/md0 -a /dev/sda
    
    Note that Hetzner also has [pretty good documentation on how to deal
    with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatten_und_Hinweise_zu_defekten_Festplatten/en).
    
    
    Hardware RAID
    =============
    
    Some TPO machines have hardware RAID with `megaraid`
    controllers. Those are controlled with the `MegaCLI` command that is
    ... rather hard to use.
    
    First, alias the megacli command because the package (derived from the
    upstream RPM by Alien) installs it in a strange location:
    
        alias megacli=/opt/MegaRAID/MegaCli/MegaCli
    
    This will confirm you are using hardware raid:
    
        root@moly:/home/anarcat# lspci | grep -i raid
        05:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
    
    This will show the RAID levels of each enclosure, for example this is
    RAID-10:
    
        root@moly:/home/anarcat# megacli -LdPdInfo -aALL | grep "RAID Level"
        RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
    
    This lists a summary of all the disks, for example the first disk has
    failed here:
    
        root@moly:/home/anarcat# megacli -PDList -aALL | grep -e '^Enclosure' -e '^Slot' -e '^PD' -e '^Firmware' -e '^Raw' -e '^Inquiry'
        Enclosure Device ID: 252
        Slot Number: 0
        Enclosure position: 0
        PD Type: SAS
        Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
        Firmware state: Failed
        Inquiry Data: SEAGATE ST3600057SS     [REDACTED]
        Enclosure Device ID: 252
        Slot Number: 1
        Enclosure position: 0
        PD Type: SAS
        Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
        Firmware state: Online, Spun Up
        Inquiry Data: SEAGATE ST3600057SS     [REDACTED]
        Enclosure Device ID: 252
        Slot Number: 2
        Enclosure position: 0
        PD Type: SAS
        Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
        Firmware state: Online, Spun Up
        Inquiry Data: SEAGATE ST3600057SS     [REDACTED]
        Enclosure Device ID: 252
        Slot Number: 3
        Enclosure position: 0
        PD Type: SAS
        Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
        Firmware state: Online, Spun Up
        Inquiry Data: SEAGATE ST3600057SS     [REDACTED]
    
    This will make the drive blink (slot number 0 in enclosure 252):
    
        megacli -PdLocate -start -physdrv[252:0] -aALL
    
    
    SMART monitoring
    ----------------
    
    Some servers will fail to properly detect disk drives in their SMART
    configuration. In particular, `smartd` does not support:
    
     * virtual disks (e.g. `/dev/nbd0`)
     * MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM
       devices)
     * out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`)
    
    The latter can be configured with the following snippet in
    `/etc/smartd.conf`:
    
        #DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
        DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
        /dev/cciss/c0d0 -d cciss,0
        /dev/cciss/c0d0 -d cciss,1
        /dev/cciss/c0d0 -d cciss,2
        /dev/cciss/c0d0 -d cciss,3
        /dev/cciss/c0d0 -d cciss,4
        /dev/cciss/c0d0 -d cciss,5
    
    Notice how the `DEVICESCAN` is commented out to be replaced by the
    CCISS configuration. One line for each drive should be added (and no,
    it does not autodetect all drives unfortunately). This hack was
    deployed on `listera` which uses that hardware RAID.
    
    Other hardware RAID controllers are better supported. For example, the
    `megaraid` controller on `moly` was correctly detected by `smartd`
    which accurately found a broken hard drive.
    
    
    References
    ----------
    
    Here are some external documentation links:
    
     * <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid>
     * <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli>
    
    anarcat's avatar
    anarcat committed
     * <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/>