Emergency IT procedures

Introduction

This page describes various procedures that are best to have a hardcopy of.

Procedure: replacing a broken disk in macaroni

This procedure is obsolete, but I should keep it because if I switch back to NFS for centralised storage then it will be useful.

  1. Determine which disk has failed (hopefully smartd or Nagios has already informed you precisely which disk has failed).
  2. Determine the partition table and partition table type of the failed disk (hopefully, you recorded this information when you installed the system). This information should look something like this:
    macaroni# parted /dev/sda print
    ...
    Partition Table: gpt
    
    Number  Start   End     Size    File system  Name  Flags
     1      1049kB  2097kB  1049kB                     bios_grub
     2      2097kB  4001GB  4001GB                     lvm
    
    macaroni#
  3. If partitions are given to LVM then examine to which VGs and LVs they are given. This information should look something like this
    macaroni# pvs /dev/sda2
      PV         VG   Fmt  Attr PSize   PFree
      /dev/sda2  vg0  lvm2 a--  931.51g    0
    macaroni# lvs vg0
      LV    VG   Attr     LSize   Pool Origin Data%  Move Log Copy%  Convert
      data0 vg0  -wi-ao-- 913.35g                                           
      data1 vg0  -wi-ao-- 913.35g                                           
      data2 vg0  -wi-ao-- 913.35g                                           
      data3 vg0  -wi-ao-- 913.35g                                           
      root  vg0  mwi-aom-  14.43g                             100.00        
      swap  vg0  mwi-aom-   3.72g                             100.00        
    macaroni#
  4. If partitions or LVs are given to MD then examine to which MDs they are given (actually the opposite is done, but the data is easily correlated). This information should look something like this:
    macaroni# mdadm -Q --detail /dev/md0
    ...
         Raid Level : raid5
    ...
        Number   Major   Minor   RaidDevice State
           0     253        6        0      active sync   /dev/dm-6
           1     253        7        1      active sync   /dev/dm-7
           2     253        8        2      active sync   /dev/dm-8
           4     253        9        3      active sync   /dev/dm-9
    macaroni# ls -ld /dev/vg0/data0
    lrwxrwxrwx 1 root root 7 Nov 30 12:31 /dev/vg0/data0 -> ../dm-6
    macaroni#
  5. Having determined this information, what remains to be done, in this specific example, is:
    1. tell mdadm that /dev/dm-6 (/dev/vg0/data0), which is part of the RAID5 md0, has failed and remove it from the RAID (thereby making the array non-redundant)
    2. remove those LVs (/dev/vg0/data0) from their parent VGs (/dev/vg0)
    3. remove the mirrors of root and swap which are on /dev/vg0
  6. Firstly, we mark one component of the MD RAID5 as failed:
    macaroni# mdadm /dev/md0 --fail /dev/dm-6 --remove /dev/dm-6
    mdadm: set /dev/dm-6 faulty in /dev/md0
    mdadm: hot remove failed for /dev/dm-6: Device or resource busy
    macaroni# mdadm /dev/md0 --remove /dev/dm-6
    mdadm: hot removed /dev/dm-6 from /dev/md0
    macaroni#

    (Note how the fail+remove failed, but that fail part was successful; a later remove – alone and done after a short delay – was successful.)

    The LV is no longer part of the RAID5 array:

    macaroni# cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4]
    md0 : active raid5 dm-9[4] dm-8[2] dm-7[1]
          2872760832 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
    
    unused devices: <none>
    macaroni#

    but it it is still part of the VG, although it is now no longer being held open:

    macaroni# lvs
      LV    VG   Attr     LSize   Pool Origin Data%  Move Log Copy%  Convert
      data0 vg0  -wi-a--- 913.35g                                           
      data1 vg0  -wi-ao-- 913.35g                                           
      data2 vg0  -wi-ao-- 913.35g                                           
      data3 vg0  -wi-ao-- 913.35g                                           
      root  vg0  mwi-aom-  14.43g                             100.00        
      swap  vg0  mwi-aom-   3.72g                             100.00        
    macaroni#
  7. Next we remove that LV from the parent VG:
    macaroni# lvremove /dev/vg0/data0
     Do you really want to remove active logical volume data0? [y/n]: y
       Logical volume "data0" successfully removed
  8. This leaves the mirrored LVs, from which we need to remove the halves on the broken disks:
    macaroni# lvconvert --mirrors 0 /dev/vg0/root /dev/sda2
      Logical volume root converted.
    macaroni# lvconvert --mirrors 0 /dev/vg0/swap /dev/sda2
      Logical volume swap converted.
    macaroni#

    As a result of that, those LVs no longer show up as mirrored:

    macaroni# lvs -a -o +devices
      LV    VG   Attr     LSize   Pool Origin Data%  Move Log Copy%  Convert Devices        
      data1 vg0  -wi-ao-- 913.35g                                            /dev/sdb2(4648)
      data2 vg0  -wi-ao-- 913.35g                                            /dev/sdc2(0)   
      data3 vg0  -wi-ao-- 913.35g                                            /dev/sdd2(0)   
      root  vg0  -wi-ao--  14.43g                                            /dev/sdb2(953)
      swap  vg0  -wi-ao--   3.72g                                            /dev/sdb2(0)   
    macaroni#
  9. Because the failed device no longer has any LVs on it, it can be removed from the VG and the VG removed from the LVM configuration:
    macaroni# vgreduce /dev/vg0 /dev/sda2
      Removed "/dev/sda2" from volume group "vg0"
    macaroni# pvremove /dev/sda2
      Labels on physical volume "/dev/sda2" successfully wiped
    macaroni#

    This leaves the following disks in use by LVM:

    macaroni# pvs
      PV         VG   Fmt  Attr PSize   PFree
      /dev/sdb2  vg0  lvm2 a--  931.51g     0
      /dev/sdc2  vg0  lvm2 a--  931.51g 18.16g
      /dev/sdd2  vg0  lvm2 a--  931.51g 18.16g
    macaroni#
  10. Then the system was powered down, the disk swapped and rebooted. Note that it did not require any interaction with GRUB in order to boot normally. So now we do pretty much the reverse of the above in order to create the necessary volumes on the new disk.
  11. First, we create the same partition table using parted:
    macaroni# parted /dev/sda
    (parted) mklabel gpt
    (parted) mkpart
    Partition name?  []?                                                      
    File system type?  [ext2]?                                                
    Start? 1048576B                                                           
    End? 2097151B                                                             
    (parted) print                                                             
    ...
    Partition Table: gpt
    
    Number  Start     End       Size      File system  Name  Flags
     1      1048576B  2097151B  1048576B
    
    (parted) set 1 bios_grub on
    (parted) mkpart                                                          
    Partition name?  []?                                                      
    File system type?  [ext2]?                                                
    Start? 2097152B                                                           
    End? -1s
    Warning: You requested a partition from 2097152B to 1000203803648B.       
    The closest location we can manage is 2097152B to 1000203786752B.
    Is this still acceptable to you?
    Yes/No? yes                                                               
    (parted) print                                                          
    ...
    Partition Table: gpt
    
    Number  Start     End             Size            File system  Name  Flags
     1      1048576B  2097151B        1048576B                           bios_grub
     2      2097152B  1000203787263B  1000201690112B
    
    (parted) set 2 lvm on
    (parted) print                                                          
    ...
    Partition Table: gpt
    
    Number  Start     End             Size            File system  Name  Flags
     1      1048576B  2097151B        1048576B                           bios_grub
     2      2097152B  1000203787263B  1000201690112B                     lvm
    
    (parted) quit                                                               
    Information: You may need to update /etc/fstab.                           
    
    macaroni#
  12. Next we give the disk to LVM and specifically to vg0:
    macaroni# pvcreate /dev/sda2
      Writing physical volume data to disk "/dev/sda2"
      Physical volume "/dev/sda2" successfully created
    macaroni#
    
    macaroni# vgextend vg0 /dev/sda2
      Volume group "vg0" successfully extended
    macaroni#
  13. Then we create the volumes:
    macaroni# lvconvert --mirrors 1 --mirrorlog core /dev/vg0/swap /dev/sda2
      vg0/swap: Converted: 2.9%
      ...
      vg0/swap: Converted: 100.0%
    macaroni# lvconvert --mirrors 1 --mirrorlog core /dev/vg0/root /dev/sda2
      vg0/root: Converted: 0.7%
      ...
      vg0/root: Converted: 100.0%
    macaroni# lvcreate --name=data0 --extents=233818 /dev/vg0 /dev/sda2
      Logical volume "data0" created
    macaroni#
  14. Finally, we add the appropriate LVs to the MDs:
    macaroni# mdadm --manage /dev/md0 --add /dev/vg0/data0
    mdadm: added /dev/vg0/data0
    macaroni# cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4]
    md0 : active raid5 dm-9[5] dm-2[1] dm-4[4] dm-3[2]
          2872760832 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
          [>....................]  recovery =  0.0% (339252/957586944) finish=188.1min speed=84813K/sec
    
    unused devices: <none>
    macaroni#
  15. It is probable that GRUB’s device map is now wrong (the UUID of the disk removed will not be the same as that of the one inserted). So discard the old disk map:
    macaroni# rm /boot/grub/device.map
    macaroni#

    Continuing on to installing GRUB as the bootloader without discarding GRUB’s old device map will result in the following error message:

    macaroni# /var/lib/dpkg/info/grub-pc.postinst configure
    /usr/sbin/grub-probe: error: Couldn't find PV pv3. Check your device.map.
    Installation finished. No error reported.
    /usr/sbin/grub-probe: error: Couldn't find PV pv3. Check your device.map.
    Installation finished. No error reported.
    /usr/sbin/grub-probe: error: Couldn't find PV pv3. Check your device.map.
    Installation finished. No error reported.
    Generating grub.cfg ...
    /usr/sbin/grub-probe: error: Couldn't find PV pv3. Check your device.map.
    Found linux image: /boot/vmlinuz-3.2.0-4-amd64
    Found initrd image: /boot/initrd.img-3.2.0-4-amd64
    /usr/sbin/grub-probe: error: Couldn't find PV pv3. Check your device.map.
    /usr/sbin/grub-probe: error: Couldn't find PV pv3. Check your device.map.
    /usr/sbin/grub-probe: error: Couldn't find PV pv3. Check your device.map.
    Found memtest86+ image: /boot/memtest86+.bin
    Found memtest86+ multiboot image: /boot/memtest86+_multiboot.bin
    done
    macaroni#
  16. Finally, install GRUB as the bootloader:
    macaroni# /var/lib/dpkg/info/grub-pc.postinst configure
    Installation finished. No error reported.
    Installation finished. No error reported.
    Installation finished. No error reported.
    Generating grub.cfg ...
    Found linux image: /boot/vmlinuz-3.2.0-4-amd64
    Found initrd image: /boot/initrd.img-3.2.0-4-amd64
    Found memtest86+ image: /boot/memtest86+.bin
    Found memtest86+ multiboot image: /boot/memtest86+_multiboot.bin
    done
    macaroni#

Procedure: rebooting/powering off fiori and/or torchio

Beware that if rebooting (not powering off) both systems (not just one), then both systems need to be rebooted within about 60 seconds of each other!

  1. Press CTRL-ALT-F1 to get to the text console.
  2. Log in to the system as a root.
  3. If shutting down the “other” system then run:
    ssh <name-of-other-system>
    poweroff           #  do this if you want to power off the system
    reboot             #  do this if you want to reboot the system

    After a few seconds you will be disconnected from the “other” system.

  4. If shutting down “this” system then run:
    poweroff           #  do this if you want to power off the system
    reboot             #  do this if you want to reboot the system
  5. Wait two minutes.

See also