Identifying failed drives via udev and mdadm
November 5th, 2009 by RichJ received 2 Comments »Given the fact that many of my systems use software RAID and are packed chock full of drives, one of my biggest pain points is that on a drive failure, I have only limited information available to identify which drive failed. So when I go to replace a drive, it is sometimes a guessing game as to which drive is the faulty drive. And quite frankly, that sucks. So using mdadm and udev, I attempt to solve this problem.
mdadm, when configured using the “MAILADDR” option in /etc/mdadm.conf will send a notification similar to the following:
running on host.example.com
A DegradedArray event had been detected on md device /dev/sdb1
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 sdb1[3] sdc1[1] sdd1[0]
204672 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
[=>...................] recovery = 6.0% (6400/102336) finish=0.7min speed=2133K/sec
unused devices: <none>
While that is good, I wanted something that would give me some physical information about the drive that I need to pull. Knowing that the device is /dev/sdb, I know that it is my second “scsi” drive. However, with the advent of Udev device names are not guaranteed to be persistent, so relying on /dev/sdb to ALWAYS be the same PHYSICAL drive would not be the best solution. OK, to plan B. Since udev is responsible for creating the entries under /dev, it knows about (and keeps a database of
) device information. Using the udevadm command I can query this information:
P: /devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0/block/sdb/sdb1
N: sdb1
W:130
S: block/8:17
S: disk/by-id/scsi-SATA_ST31500341AS_9VS10VKR-part1
S: disk/by-id/ata-ST31500341AS_9VS10VKR-part1
S: disk/by-path/pci-0000:00:11.0-scsi-1:0:0:0-part1
E: UDEV_LOG=3
E: DEVPATH=/devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0/block/sdb/sdb1
E: MAJOR=8
E: MINOR=17
E: DEVTYPE=partition
E: DEVNAME=/dev/sdb1
E: ID_VENDOR=ATA
E: ID_VENDOR_ENC=ATA\x20\x20\x20\x20\x20
E: ID_MODEL=ST31500341AS
E: ID_MODEL_ENC=ST31500341AS\x20\x20\x20\x20
E: ID_REVISION=CC1H
E: ID_TYPE=disk
E: ID_SERIAL=SATA_ST31500341AS_9VS10VKR
E: ID_SERIAL_SHORT=9VS10VKR
E: ID_BUS=scsi
E: ID_ATA_COMPAT=ST31500341AS_9VS10VKR
E: ID_PATH=pci-0000:00:11.0-scsi-1:0:0:0
E: ID_FS_USAGE=raid
E: ID_FS_TYPE=linux_raid_member
E: ID_FS_VERSION=0.90.0
E: ID_FS_UUID=9b5ac893:2e331e1a:1d5ea7ac:1d47a566
E: ID_FS_UUID_ENC=9b5ac893:2e331e1a:1d5ea7ac:1d47a566
E: DEVLINKS=/dev/block/8:17 /dev/disk/by-id/scsi-SATA_ST31500341AS_9VS10VKR-part1 /dev/disk/by-id/ata-ST31500341AS_9VS10VKR-part1 /dev/disk/by-path/pci-0000:00:11.0-scsi-1:0:0:0-part1
Now that I equipped with this information, I need to somehow roll it together with mdadm to notify me. From the mdadm.conf man page:
PROGRAM
The program line gives the name of a program to be run when mdadm –monitor detects potentially interesting events on any of the arrays that it is monitoring. This program gets run with two or three arguments, they being the Event, the md device, and possibly the related component device.
Armed with that tidbit, we whip that into a bash script, put it somewhere (/usr/local/sbin works well) & make it executable:
#The Event that occurred
MDEVENT=$1
#The md device that is affected
MDDEVICE=$2
#The physical disk that is affected
PHYSDEVICE=$3
#Subject line of the email
SUBJECT="A ""${MDEVENT}"" Event has been detected on ""${HOSTNAME}"
#Logfile for the email
LOGFILE=/tmp/mdadm_logging
echo $SUBJECT > $LOGFILE
echo "******************" >> $LOGFILE
echo "Affected Array - " $MDDEVICE >> $LOGFILE
echo "******************" >> $LOGFILE
#Check to see if the physical disk parameter ($3) has been passed. If it is not
#null then mdadm has passed it, and we can check the disk via udev.
if [ -n "$PHYSDEVICE" ];
then
echo "Affected Physical Drive - " $PHYSDEVICE >> $LOGFILE
echo "******************" >> $LOGFILE
echo "Physical Drive Information is as follows " >> $LOGFILE
echo "******************" >> $LOGFILE
udevadm info --query=all --name=$PHYSDEVICE >> $LOGFILE
fi
mail -s "$SUBJECT" root < $LOGFILE
/bin/rm $LOGFILE
Now all we need to do is edit the PROGRAM line in /etc/mdadm.conf to reflect the new script and restart the mdmonitor daemon. Below is an example on my system:
ARRAY /dev/md1 level=raid5 num-devices=4 UUID=cbcefefc:3faec7ef:e42a7516:52f160ad
And the final product
******************
Affected Array - /dev/md1
******************
Affected Physical Drive - /dev/sdb1
******************
Physical Drive Information is as follows
******************
P: /devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0/block/sdb/sdb1
N: sdb1
W:130
S: block/8:17
S: disk/by-id/scsi-SATA_ST31500341AS_9VS10VKR-part1
S: disk/by-id/ata-ST31500341AS_9VS10VKR-part1
S: disk/by-path/pci-0000:00:11.0-scsi-1:0:0:0-part1
E: UDEV_LOG=3
E: DEVPATH=/devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0/block/sdb/sdb1
E: MAJOR=8
E: MINOR=17
E: DEVTYPE=partition
E: DEVNAME=/dev/sdb1
E: ID_VENDOR=ATA
E: ID_VENDOR_ENC=ATA\x20\x20\x20\x20\x20
E: ID_MODEL=ST31500341AS
E: ID_MODEL_ENC=ST31500341AS\x20\x20\x20\x20
E: ID_REVISION=CC1H
E: ID_TYPE=disk
E: ID_SERIAL=SATA_ST31500341AS_9VS10VKR
E: ID_SERIAL_SHORT=9VS10VKR
E: ID_BUS=scsi
E: ID_ATA_COMPAT=ST31500341AS_9VS10VKR
E: ID_PATH=pci-0000:00:11.0-scsi-1:0:0:0
E: ID_FS_USAGE=raid
E: ID_FS_TYPE=linux_raid_member
E: ID_FS_VERSION=0.90.0
E: ID_FS_UUID=9b5ac893:2e331e1a:1d5ea7ac:1d47a566
E: ID_FS_UUID_ENC=9b5ac893:2e331e1a:1d5ea7ac:1d47a566
E: DEVLINKS=/dev/block/8:17 /dev/disk/by-id/scsi-SATA_ST31500341AS_9VS10VKR-part1 /dev/disk/by-id/ata-ST31500341AS_9VS10VKR-part1 /dev/disk/by-path/pci-0000:00:11.0-scsi-1:0:0:0-part1
Granted we can use grep to trim the amount of data we wish to keep from the output of the udevadm command, but I generally like a little extra verbosity.
Posted under: Linux - CentOS, Linux - Fedora, Linux - Redhat
My name is Rich Jerrido and I am the person behind




This is nice! Have you ever observe device names changing without the system the ever rebooting? I have a raid5 config, facing the same issue as you describe here I use a combination of lsscsi and smartctl to identify each drive by serial number. Today I got a email about the array degraded, referring to my notes to find the failed device. I realize the device did not failed, only the kernel name changed from /dev/sdd to /dev/sdj.
I’ve never seen device names change like that unless some type of hotplug event occured.