0

I am currently installing SuSE Linux 15. I wish to secure this system against Disk failure hence opted to configure software RAID 1 on disk1 (sda) and disk2 (sdb). Basically during installation configured the disks partitions in RAID 1 and then created following partitions. /boot/efi md0 (sda2+sdb2) /boot md1 (sda3+sdb3) /swap md2 (sda4+sdb4) / md3 (sda5+sdb5)

I then tested this system, by removing sdb system, and the system could boot without problem. But when I removed the disk sda(primary) disk, the system could not go part Grub. After analysing, I have realised the Grub is configured in such a way that it points to software raid parition md1 but it must have both the disks present, ie working software raid, in order to be able to boot the system using that device md1. I am not okay with this if this is really true. The basic idea of having software raid for /boot device is that in case of hardware failure, the system should use the other working disk and boot. Isn't it the software raid 1 meant for? I am really confused now. If you require any more system details please ask. But how do i secure the system (boot) against hardware failure. The disks do eventually fail. So there must be some solution. It just that I do not know it yet.

Regards.

Here is the output of the commands you asked for,
localhost:~ # lsblk -f | grep -v loop | sed -E 's/\w{8}-\w{4}-\w{4}-\w{4}-\w{12}/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/'
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
sda
├─sda1
├─sda2 linux_raid_member 1.0 any:0 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
├─sda3 linux_raid_member 1.0 any:1 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
├─sda4 linux_raid_member 1.0 any:2 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
│ └─md2 swap 1 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX [SWAP]
└─sda5 linux_raid_member 1.0 any:3 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
sdb
├─sdb1
├─sdb2 linux_raid_member 1.0 any:0 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
│ └─md0 vfat FAT32 ADB7-4EE5 1021.9M 0% /boot/efi
├─sdb3 linux_raid_member 1.0 any:1 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
│ └─md1 ext4 1.0 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX 853.8M 7% /boot
├─sdb4 linux_raid_member 1.0 any:2 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
│ └─md2 swap 1 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX [SWAP]
└─sdb5 linux_raid_member 1.0 any:3 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
 └─md3 ext4 1.0 XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX 4.4G 54% /
sr0 iso9660 Joliet Extension SLE-15-SP6-Full-x86_649351001 2024年06月13日-19-56-33-00
localhost:~ # fdisk -l
Disk /dev/sda: 15 GiB, 16106127360 bytes, 31457280 sectors
Disk model: VBOX HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 0E77630A-CBC9-4C1F-9321-D7886DAC09BC
 
Device Start End Sectors Size Type
/dev/sda1 2048 104447 102400 50M BIOS boot
/dev/sda2 104448 2201599 2097152 1G Linux RAID
/dev/sda3 2201600 4298751 2097152 1G Linux RAID
/dev/sda4 4298752 8493055 4194304 2G Linux RAID
/dev/sda5 8493056 31457246 22964191 11G Linux RAID
 
 
Disk /dev/sdb: 15 GiB, 16106127360 bytes, 31457280 sectors
Disk model: VBOX HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 07F46E03-A25D-4305-BCFD-E79ADEA113C1
 
Device Start End Sectors Size Type
/dev/sdb1 2048 104447 102400 50M BIOS boot
/dev/sdb2 104448 2201599 2097152 1G Linux RAID
/dev/sdb3 2201600 4298751 2097152 1G Linux RAID
/dev/sdb4 4298752 8493055 4194304 2G Linux RAID
/dev/sdb5 8493056 31457246 22964191 11G Linux RAID
 
 
Disk /dev/md1: 1023.94 MiB, 1073676288 bytes, 2097024 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
 
 
Disk /dev/md3: 10.95 GiB, 11757551616 bytes, 22963968 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
 
 
Disk /dev/md0: 1023.94 MiB, 1073676288 bytes, 2097024 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000
 
 
Disk /dev/md2: 2 GiB, 2147418112 bytes, 4194176 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Respective Partitions:

/dev/sda1 --> BIOS BOOT Partition
/dev/sda2 --> BIOS BOOT Partition
/dev/md0 (sda2+sdb2) : /boot/efi
/dev/md1 (sda3+sdb3): /boot 
/dev/md2 (sda4+sdb4): /swap 
/dev/md3 (sda5+sdb5): / 
asked Jan 13 at 19:29

2 Answers 2

2

I do not have enough reputation points to comment just yet, so I have to put some of this in an answer. What you want to do is possible and I have done something similar with RAID10, though not with EFI. At a minimum, I can suggest a disk layout that may make your task easier. If you can

RAID 1 should indeed protect you from a disk loss without crippling the machine. However, since grub does not "understand" software RAID there are some manual steps. To boot from either disk, a few things need to be true:

  • the motherboard must recognize either/both as a boot device (it sounds like you are past this)
  • each disk needs grub installed (and manually reinstalled after a drive swap)

If you set up RAID at install time, each disk must be marked as a boot device (usually not the default). This leaves space for grub on each drive. The step 1 screenshots in this guide are very informative.

Check your system to see whether the partitions are identical on both disks. If you can post the output of one of the commands below to see how your disks are currently set up (with whatever additional redactions are appropriate) that would help.

$ lsblk -f | grep -v loop | sed -E 's/\w{8}-\w{4}-\w{4}-\w{4}-\w{12}/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/'
$ fdisk -l

If the problem disk was not configured to boot at install time, you can make it that way after-the-fact. Hopefully this helps.

P.S.: In terms of layout, unless requirements dictate otherwise, you might benefit from putting the full capacity of your disks into a single md0 RAID1. You can partition the md0 further if desired (e.g., / and /home). Unless you want swap on dedicated hardware or at a different RAID level, a swap file (e.g., /swap.img) is easy to set up and much easier to resize in the future than a partition if your needs change.


UPDATE: Thanks for adding that information. Clearly the partition layouts of the disks match. But it is strange to me that lsblk -f does not show your sda2, sda3, and sda5 as part of md0, md1, and md3, respectively. There may be something wrong with the setup which could explain the boot failure (i.e., if the md0 with /boot/efi is missing from sda).

I wonder whether your system is confused about which partitions belong to which arrays. To help diagnose, could you also add the output of the following:

$ cat /proc/mdstat
$ mdadm --detail /dev/md0
$ mdadm --examine /dev/sda2
$ lsblk | grep -v loop

Hopefully those provide additional clues. For reference, I expect to see your md0 listed under both sda2 and sdb2 when things are working. The text below shows lsblk output for a working, bootable 4-disk RAID10 on Ubuntu 22. Note that md0 appears under each of the sdX2 partitions:

$ lsblk | grep -v loop
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 5.5T 0 disk 
├─sda1 8:1 0 1M 0 part 
└─sda2 8:2 0 5.5T 0 part 
 └─md0 9:0 0 10.9T 0 raid10 
 ├─md0p1 259:0 0 500G 0 part /
 └─md0p2 259:1 0 10.4T 0 part /data
sdb 8:16 0 5.5T 0 disk 
├─sdb1 8:17 0 1M 0 part 
└─sdb2 8:18 0 5.5T 0 part 
 └─md0 9:0 0 10.9T 0 raid10 
 ├─md0p1 259:0 0 500G 0 part /
 └─md0p2 259:1 0 10.4T 0 part /data
sdc 8:32 0 5.5T 0 disk 
├─sdc1 8:33 0 1M 0 part 
└─sdc2 8:34 0 5.5T 0 part 
 └─md0 9:0 0 10.9T 0 raid10 
 ├─md0p1 259:0 0 500G 0 part /
 └─md0p2 259:1 0 10.4T 0 part /data
sdd 8:48 0 5.5T 0 disk 
├─sdd1 8:49 0 1M 0 part 
└─sdd2 8:50 0 5.5T 0 part 
 └─md0 9:0 0 10.9T 0 raid10 
 ├─md0p1 259:0 0 500G 0 part /
 └─md0p2 259:1 0 10.4T 0 part /data
answered Jan 14 at 6:11
3
  • Thanks sivs422 for your thoughts. I have updated my original Post with the comments. Commented Jan 14 at 10:26
  • Thanks for your replied but I did not imagine this getting so complicated. My intention is incase of disk failure, the system should boot so that there will not be any downtime. I guess that is pretty normal ask or? For this I thought I would insert one more disk and mirror it. SO in case of disk failure the system boots from secondary disk and that should be it. But I guess it is not that simple. I really wonder, how people out there build partitions to protect disk failure then? Commented Jan 22 at 8:16
  • Can anyone just let me know what partitions I should create for my requirement and I shall do it. Commented Jan 22 at 8:17
0

I am adding a separate answer to describe how I have solved this same issue on a related OS (Ubuntu 22.04). The basic steps are all outlined in this guide.

The key steps are to (1) FIRST ensure that every disk is marked as bootable and gets a grub spacer and then (2) create a software RAID 1 array with remaining space on all disks.

In more detail, the steps you want to follow are: 0. I would recommend enabling legacy BIOS booting on the system. I am not sure this works with EFI booting.

  1. Start with empty disks (partitions deleted). Some installers need this.
  2. Select a custom storage layout when installing your OS.
  3. IMPORTANT: select a disk as primary boot device.
  4. VERY IMPORTANT: select the other disk and ensure "Add as another Boot Device" (or something similar) is checked. Once you do this, your layout will show matching grub bios partitions on all disks. See screenshots in the linked guide.
  5. Create a new GPT partition on each member disk using all of the free space. DO NOT format it. There should be a "Leave unformatted" option. IMPORTANT: make sure the capacities of these partitions match. If the disk capacities do not match, choose a smaller size that fits on all/both of them.
  6. Select the option to create software RAID. Choose RAID level 1 and make sure all/both member disks are included. Once this is done, your new array (probably called "md0") should appear in the list of available devices.
  7. Select the new RAID device and add one or more GPT partitions. The simplest thing to do is create one giant partition (mounted at /). If you wanted to set things up differently, this is where you do it. Any format should work, but I use ext4 in my systems. You do NOT need a separate partition for swap (I recommend creating a swap file in your root partition).
  8. After partitions are laid out as you wish, continue the installation as normal.

At the end of this process, you should have a system with redundancy. After first boot, check the device state with cat /proc/mdstat to make sure the mirror is complete. It might need time to mirror, depending on what the installer did. You should also have grub installed on both devices at this point. The grub configuration should point to a same-disk location, meaning either disk alone is enough to boot the system. The software RAID in this example will consist of the second partition on each disk (/dev/sda2 + /dev/sdb2 + ...).

VERY IMPORTANT: Testing is a great idea. You want to know that this provides the redundancy you need and that the recovery procedure works. HOWEVER, be careful how you do it. Each drive knows the GUIDs of all RAID members. To test whether you can boot with disk A, make sure you manually eject disk B from the software RAID before you shut down or pull the drive. ALSO be sure to wipe the partition table on B before reinstalling! If you just shut down and remove one disk, you may have a problem. The disks' contents will diverge while the system runs with one missing (logs, timestamps, etc), but the ejected disk will not "know" this and may cause a problem if it is selected as the boot device when you put it back.

VERY IMPORTANT: This setup does not preserve itself automatically if a drive fails and is replaced. You must manually restore the boot partition. When a drive fails (in what follows, assume sda is OK and sdb is a fresh disk you added), you need to:

  1. Manually copy the partition table from the remaining good disk to the newly installed disk. Something like sgdisk /dev/sda -R /dev/sdb should work. (You might need to run partprobe if the OS does not do it automatically.)
  2. Randomize the GUID on the new disk. Try sgdisk -G /dev/sdb for this.
  3. Clone the boot partition to the new disk with dd if=/dev/sda1 of=/dev/sdb1.
  4. Add the 2nd partition on the new disk (created when you copied the partition table) to the software RAID and rebuild should start. Something like mdadm --manage /dev/md0 --add /dev/sdb2 should be what you need.

I have used the above procedure without issue for a long time but note that I generally do not shut down the system when a disk is missing (the systems in question have hot-swap bays and rarely shut down). I hope this helps.

answered Feb 5 at 3:18

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.