Why does mdadm believe my raid1 "has failed so using --add cannot work"?

Question 1

TLDR: Trying to add a blank partition to a degraded RAID1 with mdadm 3.3.2 (Debian Jessie) fails, saying the (perfectly working!) array has "failed" and "--add cannot work". Is it telling me of a real danger here, or have I just hit some weird bug?

Detailed Version

Overnight, I had a disk die. There are 5 mdraid arrays on the box; one of them (a RAID10) rebuilt as expected using a spare. A RAID6 remains degraded until a replacement disk arrives tonight. Same with a 5-disk mirror for /boot. There are two RAID1 arrays used for swap; they share a hot spare. The hot spare was attached to the one which didn't experience a failure, but they're both in the same spare group, so mdadm --monitor attempted to move the spare—but it failed. Didn't give an error so far as I can tell, just lost the spare.

This morning, the degraded mirror looks like:

md124 : active raid1 sda2[0](F) sdc2[2]
 9767448 blocks super 1.2 [2/1] [_U]
 bitmap: 0/150 pages [0KB], 32KB chunk

I tried to manually add the spare, and got:

# mdadm -a /dev/md124 /dev/sdj2 
mdadm: /dev/md124 has failed so using --add cannot work and might destroy
mdadm: data on /dev/sdj2. You should stop the array and re-assemble it.

/dev/sdj2 had the superblock for the other mirror on it (as a spare in that mirror), so I went ahead and tried mdadm --zero-superblock /dev/sdj2, but even after that, the add fails with the same error. I'm pretty sure I can make this work (e.g., I haven't tried --force yet, or mdadm -r on the failed disk—or worst case—its just swap—recreate the array).

I've gone ahead and stopped using that array for now (it was used for swap). swapoff performed I/O to the array without error, so it doesn't seem failed.

Doesn't appear to be a too-small device:

# blockdev --getsize64 /dev/sda2 /dev/sdj2 
10001940480
10001940480

So I'm hoping someone else knows what this error means.

This is mdadm 3.3.2 (Debian Jessie) if it matters.

mdadm -D

# mdadm -D /dev/md124 
/dev/md124:
 Version : 1.2
 Creation Time : Thu Mar 11 20:34:00 2010
 Raid Level : raid1
 Array Size : 9767448 (9.31 GiB 10.00 GB)
 Used Dev Size : 9767448 (9.31 GiB 10.00 GB)
 Raid Devices : 2
 Total Devices : 2
 Persistence : Superblock is persistent
 Intent Bitmap : Internal
 Update Time : Mon Oct 12 12:35:13 2015
 State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
 Spare Devices : 0
 Name : Einstein:swap_a (local to host Einstein)
 UUID : 3d7da9d2:5ea17db5:3b122196:11968e91
 Events : 2044
 Number Major Minor RaidDevice State
 0 0 0 0 removed
 2 8 34 1 active sync /dev/sdc2
 0 8 2 - faulty /dev/sda2

mdadm -E

# mdadm -E /dev/sdc2
/dev/sdc2:
 Magic : a92b4efc
 Version : 1.2
 Feature Map : 0x1
 Array UUID : 3d7da9d2:5ea17db5:3b122196:11968e91
 Name : Einstein:swap_a (local to host Einstein)
 Creation Time : Thu Mar 11 20:34:00 2010
 Raid Level : raid1
 Raid Devices : 2
 Avail Dev Size : 19534897 (9.31 GiB 10.00 GB)
 Array Size : 9767448 (9.31 GiB 10.00 GB)
 Used Dev Size : 19534896 (9.31 GiB 10.00 GB)
 Data Offset : 144 sectors
 Super Offset : 8 sectors
 State : clean
 Device UUID : 95e09398:1c155ebd:323371cf:a3acc3ad
Internal Bitmap : 8 sectors from superblock
 Update Time : Mon Oct 12 12:35:13 2015
 Checksum : 132239e4 - correct
 Events : 2044
 Device Role : Active device 1
 Array State : .A ('A' == active, '.' == missing, 'R' == replacing)
# mdadm -E /dev/sdj2 
mdadm: No md superblock detected on /dev/sdj2.

Question 2

Tracing through mdadm with gdb led me to a loop that attempted to scan through the array, looking for all the sync'd devices. Except it stopped early, before it found the working sdc2. With the buggy line of code in hand:

for (d = 0; d < MAX_DISKS && found < array->active_disks; d++) {

it was fairly easy to find this was fixed in mdadm git:

commit d180d2aa2a1770af1ab8520d6362ba331400512f
Author: NeilBrown <[email protected]>
Date: Wed May 6 15:03:50 2015 +1000
 Manage: fix test for 'is array failed'.
 We 'active_disks' does not count spares, so if array is rebuilding,
 this will not necessarily find all devices, so may report an array
 as failed when it isn't.
 Counting up to nr_disks is better.
 Signed-off-by: NeilBrown <[email protected]>
diff --git a/Manage.c b/Manage.c
index d3cfb55..225af81 100644
--- a/Manage.c
+++ b/Manage.c
@@ -827,7 +827,7 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 int d;
 int found = 0;
- for (d = 0; d < MAX_DISKS && found < array->active_disks; d++) {
+ for (d = 0; d < MAX_DISKS && found < array->nr_disks; d++) {
 disc.number = d;
 if (ioctl(fd, GET_DISK_INFO, &disc))
 continue;

Applying that patch to mdadm fixes the problem. Though oddly after adding the disk, even though /proc/mdstat showed the spare present, it didn't start rebuilding until I stopped and re-assembled the array.

derobert derobert 113k20 gold badges242 silver badges288 bronze badges · Accepted Answer · 2015-10-12 17:40:54Z

Tracing through mdadm with gdb led me to a loop that attempted to scan through the array, looking for all the sync'd devices. Except it stopped early, before it found the working sdc2. With the buggy line of code in hand:

for (d = 0; d < MAX_DISKS && found < array->active_disks; d++) {

it was fairly easy to find this was fixed in mdadm git:

commit d180d2aa2a1770af1ab8520d6362ba331400512f
Author: NeilBrown <[email protected]>
Date: Wed May 6 15:03:50 2015 +1000
 Manage: fix test for 'is array failed'.
 We 'active_disks' does not count spares, so if array is rebuilding,
 this will not necessarily find all devices, so may report an array
 as failed when it isn't.
 Counting up to nr_disks is better.
 Signed-off-by: NeilBrown <[email protected]>
diff --git a/Manage.c b/Manage.c
index d3cfb55..225af81 100644
--- a/Manage.c
+++ b/Manage.c
@@ -827,7 +827,7 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 int d;
 int found = 0;
- for (d = 0; d < MAX_DISKS && found < array->active_disks; d++) {
+ for (d = 0; d < MAX_DISKS && found < array->nr_disks; d++) {
 disc.number = d;
 if (ioctl(fd, GET_DISK_INFO, &disc))
 continue;

Applying that patch to mdadm fixes the problem. Though oddly after adding the disk, even though /proc/mdstat showed the spare present, it didn't start rebuilding until I stopped and re-assembled the array.

Stack Exchange Network

Why does mdadm believe my raid1 "has failed so using --add cannot work"?

Detailed Version

mdadm -D

mdadm -E

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Why does mdadm believe my raid1 "has failed so using --add cannot work"?

Detailed Version

mdadm -D

mdadm -E

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions