Moving the root filesystem to ZFS

Introduction

This procedure explains how to move the root filesystem to ZFS on Debian 12. Hopefully it gives enough clues to help with other Linux distributions too.

Procedure

All commands are to be run as root unless otherwise noted.

  1. If not already done, then install the OS using the standard install media but setting the partitioning up as follows:
    1. create a partition of size 550MB for EFI (if your system does not support UEFI then skip this step).
    2. create a partition of size 1024MB to be formatted with ext4 and to be mounted as /boot.
    3. create a partition of size <the-bigger-of-4GB-and-desired-swap-space> to be formatted with ext4 and to be mounted as /.
    4. create a partition using all remaining space to be not used.
    5. If there is a second disk then create the same partitions as above on it but mark them all to not be used. (For the first three partitions this is to for symmetry with the first disk.)

    (There is intentionally no swap space.)

  2. Use fdisk to set the partition type of the big partition on each disk to Solaris root (156), which is what ZFS uses.
  3. Edit /etc/apt/sources.list to contain:
    deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware
    deb http://security.debian.org/debian-security bookworm-security main contrib non-free non-free-firmware
  4. Update the cache of available packages and install ZFS by running:
    apt-get update
    apt -y install zfsutils-linux
    

    (This will be slow as kernel modules need to be compiled.)

  5. The zfs kernel module was probably not loaded by that so load it:
    modprobe zfs
  6. Create a RAID0 zpool using the big partitions on each disk:
    zpool create zpool0 /dev/vda4 /dev/vdb4
    

    (I’m doing this on a VM so my devices are on vda and vdb. To create other types of RAID refer to the Debian wiki.)

  7. Although not directly related to moving the root filesystem on to ZFS, having just created a zpool and there being cronjobs to regularly scrub it, we are still missing something to report scrub failures. Therefore run:
    echo -e '#!/bin/sh\nzpool status -x | fgrep -vx "all pools are healthy" || true' > /etc/cron.hourly/zfs-report-scrub-errors
    chmod 755 /etc/cron.hourly/zfs-report-scrub-errors

    (It would be nicer to append something to the command initiating asynchronous scrubbing in /etc/cron.d/zfsutils-linux to the effect of:

    sleep-until-scrubbing-finished
    zpool status -x | fgrep -vx "all pools are healthy"

    but the above will do for the moment.)

  8. Create a ZFS dataset (i.e. a ZFS filesystem) for the root filesystem by running:
    zfs create -o quota=15g -o mountpoint=legacy zpool0/root
    

    (That sets the filesystem quota to 15GB, which will by shown as the filesystem size by df. According to this post, letting ZFS manage the mountpoint for you isn’t really an option, hence the second option.)

  9. Install some commands we’ll use shortly:
    apt -y install zstd wget rsync
  10. Having the root filesystem on ZFS requires that the initd image has support for ZFS. Do this by running:
    apt -y install zfs-initramfs
    
  11. To work around a bug in zfs-initramfs:
    1. Run the following:
      cd /usr/share/initramfs-tools/scripts
      cp zfs ~/zfs.orig-delete-me-soon
      wget -qO - https://raw.githubusercontent.com/openzfs/zfs/master/contrib/initramfs/scripts/zfs > zfs

      (Just in case that file disappears or gets edited to the extent that it no longer works, here‘s a copy of it as it was on 22 March 2023.)

    2. Then regenerate the initrd image by running:
      mkinitramfs -o /boot/initrd.img-$(uname -r)
  12. Mount the ZFS fileset and copy the root filesystem into it:
    mount -t zfs zpool0/root /mnt
    rsync -ax / /mnt/
    umount /mnt
  13. Shortly we will regenerate the grub configuration file so that it boots from the new ZFS root filesystem, but that regeneration assumes that whatever is currently the root filesystem (according to /proc/cmdline) is what we want the root filesystem to be. So the first step is to – temporarily – boot the right filesystem. Do this as follows:
    1. Reboot and  at grub’s boot menu press ‘e’ to interupt the boot selection and to edit the grub code.
    2. Change this:
      linux ... root=UUID=<some-uuid> ...

      to this:

      linux ... root=ZFS=zpool0/root ...
    3. Press CTRL-X to boot.
    4. Log in as root.
    5. Run df to verify that the root filesystem is now the one on ZFS.
  14. Update the grub configuration file so that the system permanently boots from the ZFS root filesystem:
    update-grub
    

    and reboot to test, this time without interrupting the boot process.

  15. We don’t need to make the same change in /etc/fstab, since the ZFS root filesystem is correctly mounted, but it is sensible to this file aligned.
  16. After changing /etc/fstab, systemd needs to be informed by running:
    systemctl daemon-reload
    

    Without this mounting anything will result in the warning:

    mount: (hint) your fstab has been modified, but systemd still uses
    the old version; use 'systemctl daemon-reload' to reload.
  17. Finally we need to repurpose the old root filesystem to be swap:
    1. use fdisk to set the partition #3 to type Linux swap (18)
    2. use mkswap for format it for swap
    3. add an entry to /etc/fstab
    4. use swapon -a to activate it.
  18. To work around a bug in initramfs-tools-core or perhaps it’s another bug in zfs-initramfs:
    1. Run:
      mkinitramfs -o /boot/initrd.img-$(uname -r)
      W: Couldn't identify type of root file system for fsck hook
      zstdcat /boot/initrd.img-$(uname -r) | cpio -itv | grep fsck

      and notice that it does not list fsck.zfs.

    2. Edit /usr/share/initramfs-tools/hooks/fsck, locate the following code:
      if [ "${MNT_DIR}" = "/" ] || [ "${MNT_TYPE}" = "auto" ]; then
          MNT_FSNAME="$(resolve_device "${MNT_FSNAME}")"
          fstype() { "/usr/lib/klibc/bin/fstype" "$@"; }
          if ! get_fstype "${MNT_FSNAME}"; then
              echo "W: Couldn't identify type of $2 file system for fsck hook" >&2
          fi
          unset -f fstype
      else
          ...

      and replace the code in the ‘then’ block with just:

      if [ "${MNT_DIR}" = "/" ] || [ "${MNT_TYPE}" = "auto" ]; then
          mount | sed -rn "s@.* on $MNT_DIR type ([^ ]+) \\(.*@\\1@p"
      else
          ...
    3. Rerun the commands in the first step and notice that it now does list fsck.zfs (and fsck).
  19. Finally reboot again, just to make sure that the system has been left reboot-safe.

See also