About me

My name is Rich Jerrido and I am the person behind www.outsidaz.org I am a geek hailing from the city of brotherly love. I started this blog a couple of years back as a dumping place for a lot of working knowledge of mine that I could have available online regardless of where I was. Over time it has evolved into being a full-fledged blog, complete with RSS feeds, comments, and pictures.When I am not hacking on computers for profit, I hack on them for fun.Read more about me »

New Year….New Desktop

New Year….New Desktop Featured Work

Keep in touch

RSS Feed Twitter Facebook Delicious

Subscribe via Email

Identifying failed drives via udev and mdadm

November 5th, 2009 by RichJ received 2 Comments »

Given the fact that many of my systems use software RAID and are packed chock full of drives, one of my biggest pain points is that on a drive failure, I have only limited information available to identify which drive failed. So when I go to replace a drive, it is sometimes a guessing game as to which drive is the faulty drive. And quite frankly, that sucks. So using mdadm and udev, I attempt to solve this problem.

mdadm, when configured using the “MAILADDR” option in /etc/mdadm.conf will send a notification similar to the following:

This is an automatically generated mail message from mdadm
running on host.example.com

A DegradedArray event had been detected on md device /dev/sdb1

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 sdb1[3] sdc1[1] sdd1[0]
      204672 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
      [=>...................]  recovery =  6.0% (6400/102336) finish=0.7min speed=2133K/sec
     
unused devices: <none>

While that is good, I wanted something that would give me some physical information about the drive that I need to pull. Knowing that the device is /dev/sdb, I know that it is my second “scsi” drive. However, with the advent of Udev device names are not guaranteed to be persistent, so relying on /dev/sdb to ALWAYS be the same PHYSICAL drive would not be the best solution. OK, to plan B. Since udev is responsible for creating the entries under /dev, it knows about (and keeps a database of :) ) device information. Using the udevadm command I can query this information:

$udevadm info --query=all --name=/dev/sdb1
P: /devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0/block/sdb/sdb1
N: sdb1
W:130
S: block/8:17
S: disk/by-id/scsi-SATA_ST31500341AS_9VS10VKR-part1
S: disk/by-id/ata-ST31500341AS_9VS10VKR-part1
S: disk/by-path/pci-0000:00:11.0-scsi-1:0:0:0-part1
E: UDEV_LOG=3
E: DEVPATH=/devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0/block/sdb/sdb1
E: MAJOR=8
E: MINOR=17
E: DEVTYPE=partition
E: DEVNAME=/dev/sdb1
E: ID_VENDOR=ATA
E: ID_VENDOR_ENC=ATA\x20\x20\x20\x20\x20
E: ID_MODEL=ST31500341AS
E: ID_MODEL_ENC=ST31500341AS\x20\x20\x20\x20
E: ID_REVISION=CC1H
E: ID_TYPE=disk
E: ID_SERIAL=SATA_ST31500341AS_9VS10VKR
E: ID_SERIAL_SHORT=9VS10VKR
E: ID_BUS=scsi
E: ID_ATA_COMPAT=ST31500341AS_9VS10VKR
E: ID_PATH=pci-0000:00:11.0-scsi-1:0:0:0
E: ID_FS_USAGE=raid
E: ID_FS_TYPE=linux_raid_member
E: ID_FS_VERSION=0.90.0
E: ID_FS_UUID=9b5ac893:2e331e1a:1d5ea7ac:1d47a566
E: ID_FS_UUID_ENC=9b5ac893:2e331e1a:1d5ea7ac:1d47a566
E: DEVLINKS=/dev/block/8:17 /dev/disk/by-id/scsi-SATA_ST31500341AS_9VS10VKR-part1 /dev/disk/by-id/ata-ST31500341AS_9VS10VKR-part1 /dev/disk/by-path/pci-0000:00:11.0-scsi-1:0:0:0-part1

Now that I equipped with this information, I need to somehow roll it together with mdadm to notify me. From the mdadm.conf man page:

PROGRAM
The program line gives the name of a program to be run when mdadm –monitor detects potentially interesting events on any of the arrays that it is monitoring. This program gets run with two or three arguments, they being the Event, the md device, and possibly the related component device.

Armed with that tidbit, we whip that into a bash script, put it somewhere (/usr/local/sbin works well) & make it executable:

#!/bin/bash

#The Event that occurred
MDEVENT=$1

#The md device that is affected
MDDEVICE=$2

#The physical disk that is affected
PHYSDEVICE=$3

#Subject line of the email
SUBJECT="A ""${MDEVENT}"" Event has been detected on ""${HOSTNAME}"

#Logfile for the email
LOGFILE=/tmp/mdadm_logging

echo $SUBJECT > $LOGFILE
echo "******************" >> $LOGFILE
echo "Affected Array  - " $MDDEVICE >> $LOGFILE
echo "******************" >> $LOGFILE

#Check to see if the physical disk parameter ($3) has been passed. If it is not
#null then mdadm has passed it, and we can check the disk via udev.
if [ -n "$PHYSDEVICE" ];
        then
                echo "Affected Physical Drive  - " $PHYSDEVICE >> $LOGFILE
                echo "******************" >> $LOGFILE
                echo "Physical Drive Information is as follows " >> $LOGFILE
                echo "******************" >> $LOGFILE
                udevadm info --query=all --name=$PHYSDEVICE >> $LOGFILE
fi
mail -s "$SUBJECT" root < $LOGFILE
/bin/rm $LOGFILE

Now all we need to do is edit the PROGRAM line in /etc/mdadm.conf to reflect the new script and restart the mdmonitor daemon. Below is an example on my system:

PROGRAM /usr/local/sbin/raid_notify.sh
ARRAY /dev/md1 level=raid5 num-devices=4 UUID=cbcefefc:3faec7ef:e42a7516:52f160ad

And the final product

A Fail Event has been detected on host.example.com
******************
Affected Array  -  /dev/md1
******************
Affected Physical Drive  -  /dev/sdb1
******************
Physical Drive Information is as follows
******************
P: /devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0/block/sdb/sdb1
N: sdb1
W:130
S: block/8:17
S: disk/by-id/scsi-SATA_ST31500341AS_9VS10VKR-part1
S: disk/by-id/ata-ST31500341AS_9VS10VKR-part1
S: disk/by-path/pci-0000:00:11.0-scsi-1:0:0:0-part1
E: UDEV_LOG=3
E: DEVPATH=/devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0/block/sdb/sdb1
E: MAJOR=8
E: MINOR=17
E: DEVTYPE=partition
E: DEVNAME=/dev/sdb1
E: ID_VENDOR=ATA
E: ID_VENDOR_ENC=ATA\x20\x20\x20\x20\x20
E: ID_MODEL=ST31500341AS
E: ID_MODEL_ENC=ST31500341AS\x20\x20\x20\x20
E: ID_REVISION=CC1H
E: ID_TYPE=disk
E: ID_SERIAL=SATA_ST31500341AS_9VS10VKR
E: ID_SERIAL_SHORT=9VS10VKR
E: ID_BUS=scsi
E: ID_ATA_COMPAT=ST31500341AS_9VS10VKR
E: ID_PATH=pci-0000:00:11.0-scsi-1:0:0:0
E: ID_FS_USAGE=raid
E: ID_FS_TYPE=linux_raid_member
E: ID_FS_VERSION=0.90.0
E: ID_FS_UUID=9b5ac893:2e331e1a:1d5ea7ac:1d47a566
E: ID_FS_UUID_ENC=9b5ac893:2e331e1a:1d5ea7ac:1d47a566
E: DEVLINKS=/dev/block/8:17 /dev/disk/by-id/scsi-SATA_ST31500341AS_9VS10VKR-part1 /dev/disk/by-id/ata-ST31500341AS_9VS10VKR-part1 /dev/disk/by-path/pci-0000:00:11.0-scsi-1:0:0:0-part1

Granted we can use grep to trim the amount of data we wish to keep from the output of the udevadm command, but I generally like a little extra verbosity.

Posted under: Linux - CentOS, Linux - Fedora, Linux - Redhat


2 Responses to “Identifying failed drives via udev and mdadm”

  1. swygue says:

    This is nice! Have you ever observe device names changing without the system the ever rebooting? I have a raid5 config, facing the same issue as you describe here I use a combination of lsscsi and smartctl to identify each drive by serial number. Today I got a email about the array degraded, referring to my notes to find the failed device. I realize the device did not failed, only the kernel name changed from /dev/sdd to /dev/sdj.

  2. RichJ says:

    I’ve never seen device names change like that unless some type of hotplug event occured.


Leave a Reply