0

I have an Ubuntu 22.04 server that has its boot disk on a mdadm RAID1 array consisting of two 240 GB SSDs (/dev/sda & /dev/sdb). This mdadm array was setup using curtin during the initial install. In addition only the boot, root & swap file systems are on this array - all other files are on a ZFS RAID10 array.

One of the disks (/dev/sda) has now failed completely and needs to be replaced. While the system continues to run on the other disk (/dev/sdb), it will only boot on the failed disk (/dev/sda). This presents somewhat of a problem since I will need to reboot the system on /dev/sdb after I have shutdown the system and replaced /dev/sda. Both /dev/sda & /dev/sdb have up to date EFI & /boot partitions.

I am currently planning the replacement and would appreciate any advice. So far, I think I will need to do the following:

  1. mark the partitions as failed using mdadm
  2. remove the failed partitions from the array using mdadm
  3. set /dev/sdb to be the boot disk
  4. shutdown the system
  5. physically remove the failed disk and replace it with a new disk
  6. restart the system
  7. partition the new disk using sfdisk
  8. add the new partitions to the existing arrays using mdadm
  9. copy the files from the EFI partition to the new disk
  10. update grub

Most of the process looks pretty straight forward. It is the step 3 and 10 that deal with booting that I am not sure about.

Below are the details of my setup:

fdisk /dev/sdb (both disk are partitioned the same)

Device Start End Sectors Size Type
/dev/sdb1 2048 2203647 2201600 1G EFI System
/dev/sdb2 2203648 4300799 2097152 1G Linux filesystem
/dev/sdb3 4300800 71409663 67108864 32G Linux filesystem
/dev/sdb4 71409664 468858879 397449216 189.5G Linux filesystem

cat /proc/mdstat

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md125 : active raid1 sdb4[0] sda4[1](F)
 198592512 blocks super 1.2 [2/1] [U_]
 bitmap: 2/2 pages [8KB], 65536KB chunk
md126 : active raid1 sdb3[1] sda3[0](F)
 33520640 blocks super 1.2 [2/1] [_U]
 
md127 : active raid1 sdb2[0] sda2[2](F)
 1046528 blocks super 1.2 [2/1] [U_]

cat /etc/fstab

# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/md125p1 during curtin installation
/dev/disk/by-id/md-uuid-7f83998a:b81f586c:e3e6497a:9a9e36ce-part1 / ext4 defaults 0 1
/dev/disk/by-id/md-uuid-56619c5a:2fc620ba:3642eeae:73fd6319-part1 none swap sw 0 0
# /boot was on /dev/md127p1 during curtin installation
/dev/disk/by-id/md-uuid-78148d71:a0c26fd8:9ee89f4c:bfa69120-part1 /boot ext4 defaults 0 1
# /boot/efi was on /dev/sda1 during curtin installation
/dev/disk/by-uuid/D72E-12F9 /boot/efi vfat defaults 0 1

lsblk

├─sda1 8:1 0 1G 0 part 
├─sda2 8:2 0 1G 0 part 
│ └─md127 9:127 0 1022M 0 raid1 
│ └─md127p1 259:1 0 1020M 0 part /boot
├─sda3 8:3 0 32G 0 part 
│ └─md126 9:126 0 32G 0 raid1 
│ └─md126p1 259:0 0 32G 0 part [SWAP]
└─sda4 8:4 0 189.5G 0 part 
 └─md125 9:125 0 189.4G 0 raid1 
 └─md125p1 259:2 0 189.4G 0 part /
sdb 8:16 0 223.6G 0 disk 
├─sdb1 8:17 0 1G 0 part /boot/efi
├─sdb2 8:18 0 1G 0 part 
│ └─md127 9:127 0 1022M 0 raid1 
│ └─md127p1 259:1 0 1020M 0 part /boot
├─sdb3 8:19 0 32G 0 part 
│ └─md126 9:126 0 32G 0 raid1 
│ └─md126p1 259:0 0 32G 0 part [SWAP]
└─sdb4 8:20 0 189.5G 0 part 
 └─md125 9:125 0 189.4G 0 raid1 
 └─md125p1 259:2 0 189.4G 0 part /
asked Sep 16, 2024 at 22:39

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.