TLDR: Trying to add a blank partition to a degraded RAID1 with mdadm 3.3.2 (Debian Jessie) fails, saying the (perfectly working!) array has "failed" and "--add cannot work". Is it telling me of a real danger here, or have I just hit some weird bug?
Detailed Version
Overnight, I had a disk die. There are 5 mdraid arrays on the box; one of them (a RAID10) rebuilt as expected using a spare. A RAID6 remains degraded until a replacement disk arrives tonight. Same with a 5-disk mirror for /boot. There are two RAID1 arrays used for swap; they share a hot spare. The hot spare was attached to the one which didn't experience a failure, but they're both in the same spare group, so mdadm --monitor
attempted to move the spare—but it failed. Didn't give an error so far as I can tell, just lost the spare.
This morning, the degraded mirror looks like:
md124 : active raid1 sda2[0](F) sdc2[2]
9767448 blocks super 1.2 [2/1] [_U]
bitmap: 0/150 pages [0KB], 32KB chunk
I tried to manually add the spare, and got:
# mdadm -a /dev/md124 /dev/sdj2
mdadm: /dev/md124 has failed so using --add cannot work and might destroy
mdadm: data on /dev/sdj2. You should stop the array and re-assemble it.
/dev/sdj2
had the superblock for the other mirror on it (as a spare in that mirror), so I went ahead and tried mdadm --zero-superblock /dev/sdj2
, but even after that, the add fails with the same error. I'm pretty sure I can make this work (e.g., I haven't tried --force
yet, or mdadm -r
on the failed disk—or worst case—its just swap—recreate the array).
I've gone ahead and stopped using that array for now (it was used for swap). swapoff
performed I/O to the array without error, so it doesn't seem failed.
Doesn't appear to be a too-small device:
# blockdev --getsize64 /dev/sda2 /dev/sdj2
10001940480
10001940480
So I'm hoping someone else knows what this error means.
This is mdadm 3.3.2 (Debian Jessie) if it matters.
mdadm -D
# mdadm -D /dev/md124
/dev/md124:
Version : 1.2
Creation Time : Thu Mar 11 20:34:00 2010
Raid Level : raid1
Array Size : 9767448 (9.31 GiB 10.00 GB)
Used Dev Size : 9767448 (9.31 GiB 10.00 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Mon Oct 12 12:35:13 2015
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Name : Einstein:swap_a (local to host Einstein)
UUID : 3d7da9d2:5ea17db5:3b122196:11968e91
Events : 2044
Number Major Minor RaidDevice State
0 0 0 0 removed
2 8 34 1 active sync /dev/sdc2
0 8 2 - faulty /dev/sda2
mdadm -E
# mdadm -E /dev/sdc2
/dev/sdc2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 3d7da9d2:5ea17db5:3b122196:11968e91
Name : Einstein:swap_a (local to host Einstein)
Creation Time : Thu Mar 11 20:34:00 2010
Raid Level : raid1
Raid Devices : 2
Avail Dev Size : 19534897 (9.31 GiB 10.00 GB)
Array Size : 9767448 (9.31 GiB 10.00 GB)
Used Dev Size : 19534896 (9.31 GiB 10.00 GB)
Data Offset : 144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 95e09398:1c155ebd:323371cf:a3acc3ad
Internal Bitmap : 8 sectors from superblock
Update Time : Mon Oct 12 12:35:13 2015
Checksum : 132239e4 - correct
Events : 2044
Device Role : Active device 1
Array State : .A ('A' == active, '.' == missing, 'R' == replacing)
# mdadm -E /dev/sdj2
mdadm: No md superblock detected on /dev/sdj2.
1 Answer 1
Tracing through mdadm
with gdb
led me to a loop that attempted to scan through the array, looking for all the sync'd devices. Except it stopped early, before it found the working sdc2. With the buggy line of code in hand:
for (d = 0; d < MAX_DISKS && found < array->active_disks; d++) {
it was fairly easy to find this was fixed in mdadm git:
commit d180d2aa2a1770af1ab8520d6362ba331400512f
Author: NeilBrown <[email protected]>
Date: Wed May 6 15:03:50 2015 +1000
Manage: fix test for 'is array failed'.
We 'active_disks' does not count spares, so if array is rebuilding,
this will not necessarily find all devices, so may report an array
as failed when it isn't.
Counting up to nr_disks is better.
Signed-off-by: NeilBrown <[email protected]>
diff --git a/Manage.c b/Manage.c
index d3cfb55..225af81 100644
--- a/Manage.c
+++ b/Manage.c
@@ -827,7 +827,7 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
int d;
int found = 0;
- for (d = 0; d < MAX_DISKS && found < array->active_disks; d++) {
+ for (d = 0; d < MAX_DISKS && found < array->nr_disks; d++) {
disc.number = d;
if (ioctl(fd, GET_DISK_INFO, &disc))
continue;
Applying that patch to mdadm fixes the problem. Though oddly after adding the disk, even though /proc/mdstat
showed the spare present, it didn't start rebuilding until I stopped and re-assembled the array.