I have an Ubuntu 22.04 server that has its boot disk on a mdadm RAID1 array consisting of two 240 GB SSDs (/dev/sda & /dev/sdb). This mdadm array was setup using curtin during the initial install. In addition only the boot, root & swap file systems are on this array - all other files are on a ZFS RAID10 array.
One of the disks (/dev/sda) has now failed completely and needs to be replaced. While the system continues to run on the other disk (/dev/sdb), it will only boot on the failed disk (/dev/sda). This presents somewhat of a problem since I will need to reboot the system on /dev/sdb after I have shutdown the system and replaced /dev/sda. Both /dev/sda & /dev/sdb have up to date EFI & /boot partitions.
I am currently planning the replacement and would appreciate any advice. So far, I think I will need to do the following:
- mark the partitions as failed using mdadm
- remove the failed partitions from the array using mdadm
- set /dev/sdb to be the boot disk
- shutdown the system
- physically remove the failed disk and replace it with a new disk
- restart the system
- partition the new disk using sfdisk
- add the new partitions to the existing arrays using mdadm
- copy the files from the EFI partition to the new disk
- update grub
Most of the process looks pretty straight forward. It is the step 3 and 10 that deal with booting that I am not sure about.
Below are the details of my setup:
fdisk /dev/sdb (both disk are partitioned the same)
Device Start End Sectors Size Type
/dev/sdb1 2048 2203647 2201600 1G EFI System
/dev/sdb2 2203648 4300799 2097152 1G Linux filesystem
/dev/sdb3 4300800 71409663 67108864 32G Linux filesystem
/dev/sdb4 71409664 468858879 397449216 189.5G Linux filesystem
cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md125 : active raid1 sdb4[0] sda4[1](F)
198592512 blocks super 1.2 [2/1] [U_]
bitmap: 2/2 pages [8KB], 65536KB chunk
md126 : active raid1 sdb3[1] sda3[0](F)
33520640 blocks super 1.2 [2/1] [_U]
md127 : active raid1 sdb2[0] sda2[2](F)
1046528 blocks super 1.2 [2/1] [U_]
cat /etc/fstab
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/md125p1 during curtin installation
/dev/disk/by-id/md-uuid-7f83998a:b81f586c:e3e6497a:9a9e36ce-part1 / ext4 defaults 0 1
/dev/disk/by-id/md-uuid-56619c5a:2fc620ba:3642eeae:73fd6319-part1 none swap sw 0 0
# /boot was on /dev/md127p1 during curtin installation
/dev/disk/by-id/md-uuid-78148d71:a0c26fd8:9ee89f4c:bfa69120-part1 /boot ext4 defaults 0 1
# /boot/efi was on /dev/sda1 during curtin installation
/dev/disk/by-uuid/D72E-12F9 /boot/efi vfat defaults 0 1
lsblk
├─sda1 8:1 0 1G 0 part
├─sda2 8:2 0 1G 0 part
│ └─md127 9:127 0 1022M 0 raid1
│ └─md127p1 259:1 0 1020M 0 part /boot
├─sda3 8:3 0 32G 0 part
│ └─md126 9:126 0 32G 0 raid1
│ └─md126p1 259:0 0 32G 0 part [SWAP]
└─sda4 8:4 0 189.5G 0 part
└─md125 9:125 0 189.4G 0 raid1
└─md125p1 259:2 0 189.4G 0 part /
sdb 8:16 0 223.6G 0 disk
├─sdb1 8:17 0 1G 0 part /boot/efi
├─sdb2 8:18 0 1G 0 part
│ └─md127 9:127 0 1022M 0 raid1
│ └─md127p1 259:1 0 1020M 0 part /boot
├─sdb3 8:19 0 32G 0 part
│ └─md126 9:126 0 32G 0 raid1
│ └─md126p1 259:0 0 32G 0 part [SWAP]
└─sdb4 8:20 0 189.5G 0 part
└─md125 9:125 0 189.4G 0 raid1
└─md125p1 259:2 0 189.4G 0 part /