Investigating the suitability of LXC as a virtualisation platform

Introduction

I am using KVM as my virtualisation platform. I now have enough virtual machines (VMs) to generate more CPU fan noise than I want to listen to. So I wanted to investigate whether LXC would make a suitable lightweight replacement and I want to learn about LXC.

Beware that the scripts referenced here may have changed during or since this page was written! That means that calling them as described below may actually not work! That is the nature of a development log!

Hard requirement: isolation

The libvirt documentation about isolation explains that for the isolation of:

  • network: there is isolation only if the container has network interfaces
  • filesystems: there is isolation only if the container has filesystems
  • users and groups: there is isolation only if the container uses mapping

But there are additional things that need to be isolated:

  • IPC
  • used memory (stuff in host memory should not be accessible from the container)
  • free memory (the container should not see more memory that it is allowed to)
  • process trees
  • devices (the container should not be able to mknod stuff and if the host mknods stuff for it then those devices should still not be accessible)
  • this document provides some more ideas about what can be isolated

Hard requirement: libvirt-based management

Currently I use one DRBD-over-LV per VM, which obviates the normal cluster requirement for a shared filesystem while still allowing an even distribution of VMs across servers and providing redundancy for all hardware components.

This model exposes block devices, which provide a stable and clean way to interface to libvirt/KVM, although the devices are not under libvirt’s pool management facilities. This is a comfortable environment and one that hides the details of managing the underlying KVM virtualisation platform.

Since one possible outcome of this investigation is that I gradually migrate VMs from KVM to LXC within one cluster, then being able to use libvirt in the future is essential.

Hard-requirement: DRBD-backed storage volumes

DRBD provides cheap redundancy with low management overhead. So I want to stick with it.

Soft requirement: use of libvirt for storage management

As just mentioned, I currently cannot use libvirt’s pool management, but it would be nice if I could.

Soft requirement: OS-level containers

In order to provide a standard environment on all physical machines (PMs), VMs and containers, I want the OS to be my unit of deployment; I do not want the application  – Docker-style – to be my unit of deployment.

Prologue

I did not want to contaminate my PM virtualisation servers (fiori and torchio) so I used a KVM VM (testaroli) to experiment on. I Installed Debian 10 on this VM.

Installing LXC

  1. As per Debian’s LXC documentation, I ran:
    apt-get install -y lxc libvirt0 libpam-cgfs bridge-utils uidmap
  2. I added a new connection to virt-manager: remote host testaroli.pasta.net, type: LXC.
  3. To support using virsh I also ran in a terminal:
    # if working remotely
    export LIBVIRT_DEFAULT_URI=lxc+ssh://root@testaroli.pasta.net/
    # if working locally
    export LIBVIRT_DEFAULT_URI=lxc:///

Exploring isolation

  1. I use virt-manager to create a new application container named vm1 to run the default application (/bin/sh), no network interfaces (if you have defined networks then you will actually have to remove an interface), the default 1GB RAM and no filesystems (but note that one filesystem mapping is automatically created when the container is started) and no user mapping:
  2. To test IPC isolation I ran this in the host:
    testaroli# perl -e 'use IPC::SysV qw(IPC_CREAT); $IPC_KEY = 1234; $id = semget($IPC_KEY, 10, 0666 | IPC_CREAT);'
    testaroli# ipcs -s
    
    ------ Semaphore Arrays --------
    key        semid      owner      perms      nsems     
    0x000004d2 360448     root       666        10        
    
    testaroli# 
    

    (that code is taken straight from the perlipc(1) man page) and in the container I ran:

    root@vm1:/# ipcs -s
    ------ Semaphore Arrays --------
    key semid owner perms nsems
    root@vm1:/#

    From that I concluded that there is IPC isolation by default.

  3. To test free memory isolation I ran the following in the host:
    testaroli# head -1 /proc/meminfo
    MemTotal: 4039828 kB
    testaroli#

    and then this in the container:

    /bin/sh: 0: can't access tty; job control turned off
    # / exec bash
    root@vm1:/# head -1 /proc/meminfo
    MemTotal: 4039828 kB
    root@vm1:/#

    From this I concluded there was no free memory isolation! It looks like this is due to a bug. See Questions below.

  4. To test process isolation I ran the following in the host:
    testaroli# ps -ef | wc -l
    133
    testaroli# sleep 300 &
    [1] 3736
    testaroli#

    and then this in the container:

    /bin/sh: 0: can't access tty; job control turned off
    # / exec bash
    root@vm1:/# ps -ef | wc -l
    4
    root@vm1:/# kill 3736
    sh: can't kill pid 3736: No such process
    root@vm1:/# ls /proc/3736
    ls: /proc/3736: No such file or directory
    root@vm1:/#

    From this I concluded there was process isolation.

  5. To test device isolation I ran the following in the host:
    testaroli# ls -lL /dev/vg0/root
    brw-rw---- 1 root disk 253, 0 May 7 14:51 /dev/vg0/root
    testaroli# mknod /tmp/vg0-root-host b 253 0
    testaroli# chmod 666 /tmp/vg0-root-host
    testaroli#

    and then this in the container:

    /bin/sh: 0: can't access tty; job control turned off
    # / exec bash
    root@vm1:/# mknod /tmp/vg0-root-container b 253 0
    mknod: /tmp/vg0-root: Operation not permitted
    root@vm1:/# ls -ld /tmp/vg0-root-host 
    brw-r--r-- 1 root root 253, 0 May 10 16:28 /tmp/vg0-root-host
    root@vm1:/# dd if=/tmp/vg0-root-host of=/dev/null bs=1M count=1
    dd: failed to open '/tmp/vg0-root-host': Permission denied
    root@vm1:/# lspci
    00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
    00:01.0 VGA compatible controller: Red Hat, Inc. QXL paravirtual graphic card (rev 04)
    00:02.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
    ...
    root@vm1:/#

    and then again on the host:

    testaroli# rm /tmp/vg0-root-host
    testaroli#

    From this I concluded that there is limited device isolation; devices appear to be visible but not readable or writable.

  6. To test network isolation, I first ran this in the container to demonstrate that by default there is no isolation:
    root@vm1:/dev# echo 'nameserver 8.8.8.8' > /etc/resolv.conf
    root@vm1:/dev# ping -c 3 www.de
    PING www.de (103.224.182.245) 56(84) bytes of data.
    64 bytes from lb-182-245.above.com (103.224.182.245): icmp_seq=1 ttl=43 time=182 ms
    64 bytes from lb-182-245.above.com (103.224.182.245): icmp_seq=2 ttl=43 time=178 ms
    64 bytes from lb-182-245.above.com (103.224.182.245): icmp_seq=3 ttl=43 time=175 ms
    ^C
    --- www.de ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 3ms
    rtt min/avg/max/mdev = 175.355/178.502/181.697/2.634 ms
    root@vm1:/dev# 
    

    and then used virt-manager on fiori to add a non-NAT-ed network to testaroli and added an interface to this network to the container:

    # exec bash
    root@vm1:/# ifconfig eth0 192.168.100.131 netmask 255.255.255.0 up
    root@vm1:/# nmap -sP 192.168.100.0/24
    Nmap scan report for 192.168.100.1
    Host is up (0.000046s latency).
    MAC Address: 52:54:00:56:EB:8C (QEMU virtual NIC)
    Nmap scan report for vm1.dummynet (192.168.100.131)
    Host is up.
    Nmap done: 256 IP addresses (2 hosts up) scanned in 16.77 seconds
    root@vm1:/# route add -net default gw 192.168.100.1
    root@vm1:/# ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
    ^C
    root@vm1:/#

    Then I shutdown the container and removed its network interface.
    From this I concluded that adding network interfaces did provide network isolation.

  7. From this point on the remaining checks were best done with a filesystem dedicated to the container.
  8. On the host I created a container-specific subdirectory and added that as the root filesystem:
    testaroli# mkdir /var/lib/libvirt/images/vm1
    testaroli#


    but  – to be expected really – that fails to start because /bin/sh does not exist within it:

    libvirt.libvirtError: internal error: guest failed to start: Failure in libvirt_lxc startup: cannot find init path '/bin/sh' relative to container root: No such file or directory

    and after putting a (static) shell in place:

    testaroli# mkdir /var/lib/libvirt/images/vm1/bin
    testaroli# cp /usr/bin/busybox /var/lib/libvirt/images/vm1/bin/sh 
    testaroli#

    and then on the container:

    / # find / -xdev | wc -l
    7
    / #

    From this I concluded that adding filesystems did provide filesystem isolation.

  9. To test used memory isolation I ran the following in the host
    testaroli# mkdir /var/lib/libvirt/images/vm1/dev2
    testaroli# mknod /var/lib/libvirt/images/vm1/dev2/mem c 1 1
    testaroli#

    and then ran this is the host:

    for I in {1..10}; do
        dd if=/dev/mem bs=1M count=1 2>/dev/null | md5sum
        sleep 1
    done

    and this in the container:

    for I in 1 2 3 4 5 6 7 8 9 10; do
        dd if=/dev2/mem bs=1M count=1 2>/dev/null | md5sum
        sleep 1
    done

    (Note avoidance of {X..Y}, which is not supported by BusyBox’s shell, and use of /dev2, as the container’s /dev appears not to be accessible from the host.)

    From the differing checksums I concluded that there is used memory isolation.

  10. To test UID/GID isolation I used virt-manager to enable user namespace:
    (In reviewing this page, I believe that screenshot above is wrong: it shows a NIC but the error below only appears when there is no NIC.)
  11. However, attempting to start the container resulted in this:

    This problem is described in this thread, which eventually includes the following from Richard Weinberger and Daniel Berrange:

    > > Fact is that commit a86b6215a74 is broken.
    > > We could also refuse to create container with userns
    > > enabled but netns disabled…
    > …
    Yes, if we are unable to figure out how to make this work, then we should report VIR_ERR_CONFIG_UNSUPPORTED for the combination of private userns + shared netns

    So I re-added the network interface, after which the container would start without problems.

  12. I tested UID/GID isolation by running this in the host:
    testaroli# mkdir /var/lib/libvirt/images/vm1/tmp
    testaroli# chmod 1777 /var/lib/libvirt/images/vm1/tmp
    testaroli# touch /var/lib/libvirt/images/vm1/tmp/y
    testaroli#

    and this in the container:

    / # touch /tmp/x
    / # ls -l /tmp/
    total 0
    -rw-r--r-- 1 0 0 0 May 19 12:47 x
    -rw-r--r-- 1 65534 65534 0 May 19 11:46 y
    / #

    and then this in the host:

    testaroli# ls -l /var/lib/libvirt/images/vm1/tmp
    total 0
    -rw-r--r-- 1 1000 1000 0 May 19 14:47 x
    -rw-r--r-- 1 root root 0 May 19 13:46 y
    testaroli#

    From this I concluded that enabling user namespaces does add isolation but see Questions below.

Storing container filesystems in images in a type ‘fs’ pool

Due to the userns-enabled-and-netns-disabled issue described earlier, these experiments were done with a userns-enabled-and-netns-enabled container.

  1. On fiori/torchio I allocated a 100GB DRBD volume and gave it to testaroli as vdb.
  2. On testaroli I ran:
    testaroli# pvcreate /dev/vdb
    testaroli# vgcreate vg1 /dev/vdb
    testaroli# lvcreate --extents=100%FREE --name=containers vg1
    testaroli# mkfs -t ext4 /dev/vg1/containers
    testaroli#
  3. I left it unmounted and used virt-manager to created a storage pool of type ‘fs’ (preformatted block device) out of it; this way libvirt could manage the mounting/unmounting.
  4. But the pool can only be mounted on one of my virtualisation servers so that would mean all VMs (in the pool) would have to be run on the same virtualisation server, which is not what I want.
  5. Alternatively, I could have one pool per virtualisation server but this is also not flexible regarding which VMs could be run where.
  6. Option rejected without further testing. Cleaned up.

Storing container filesystems in a ‘file’ filesystem

Due to the userns-enabled-and-netns-disabled issue described earlier, these experiments were done with a userns-enabled-and-netns-enabled container.

I wanted to do a ‘block’ type filesystem for a single container. This would have required create a DRBD-over-LV device again, but not to act as a pool, but for a single container. However, as a first step I though ot testing a ‘file’ filesystem. If this worked then hopefully a ‘block’ type volume would work.

  1. On testaroli I ran:
    testaroli# dd if=/dev/zero of=/var/lib/libvirt/images/vm1.img bs=1M count=1000
    testaroli# mkfs -t ext4 /var/lib/libvirt/images/vm1.img
    testaroli# losetup /dev/loop0 /var/lib/libvirt/images/vm1.img
    testaroli# mount /dev/loop0 /mnt
    testaroli# mkdir /mnt/bin
    testaroli# cp /bin/busybox /mnt/bin/sh
    testaroli# umount /mnt
    testaroli# losetup -d /dev/loop0
    testaroli#
  2. I added a filesystem to the container:
  3. I tried to start the container but got:
  4. I tried adding /dev/loop0 directly as a block device:
  5. I tried to start the container but got:
  6. This looks similar to this bug, although the link at the bottom of the page seems less relevant.
  7. If I disable user-namespace (under Overview) and revert to using the image file rather than the loop device, then it works, but I really do want to use unprivileged containers.
  8. Option reject. Cleaned up. Added todo-list item to try the same thing on a Debian testing KVM VM.

Storing container filesystems in a pool-less LV

Due to the userns-enabled-and-netns-disabled issue described earlier, these experiments were done with a userns-enabled-and-netns-enabled container.

  1. Add a 30GB ‘containers’ device to testaroli by running this:
    lagane# ~/opt/roottools/bin/pasta-virsh-vol-create-as testaroli_vm1 30 fiori
    lagane#
  2. Attach it to testaroli by running:
    fiori# virsh attach-disk testaroli /dev/drbd_testaroli_vm1 vdb
    fiori# 
    
  3. On testaroli allocate it to a new VG in LVM:
    testaroli# pvcreate /dev/vdb
    testaroli# vgcreate vg_vm1 /dev/vdb
    testaroli#
  4. Create a pool out of the VG (don’t try this with virt-manager; it gets very confused):
    testaroli# virsh pool-define-as vm1 --type=logical --target=/dev/vm1
    testaroli# virsh pool-start vm1
    testaroli#
  5. Use virt-manager to create a 1GB volume in the vm1 storage pool.
  6. Allocate this volume as a filesystem og type Block and driver Loop and target path /.
  7. Make a filesystem on the device, mount it on /mnt, create /mnt/bin, copy busybox to /mnt/bin/sh, unmount it.
  8. If username namespace is enabled then startup will fail (even with a network device) but if it is not enabled then startup will succeed. But that is an privileged container again!
  9. Reject.
  10. Clean up all the way to removing the pool but not the LV.
  11. Make sure the LV is active:
    testaroli# lvs
    ...
    vm1_root vm1 -wi------- 1.00g 
    testaroli# lvchange -a y /dev/vm1/vm1_root
    testaroli#
  12. Mount the volume:
    testaroli# mount /dev/vm1/vm1_root /var/lib/libvirt/images/vm1
    testaroli#
  13. Add a filesystem to the VM of type Mount, driver Default.
  14. Start it with username space enabled.
  15. It works!
  16. On the host run:
    testaroli# cd /var/lib/libvirt/images/vm1
    testaroli# chown 1000:1000 . * bin/*
    testaroli#

    (1000 is the beginning of the user namespace.)

  17. Then on the host this can be done:
    / # ls -al
    total 20
    drwxr-xr-x 9       0     0  4096 May 23 17:30 .
    drwxr-xr-x 9       0     0  4096 May 23 17:30 ..
    drwxr-xr-x 2   65534 65534  4096 May 23 17:16 bin
    drwxr-xr-x 3       0     0   320 May 23 17:28 dev
    drwx------ 2   65534 65534 16384 May 23 17:16 lost+found
    dr-xr-xr-x 165 65534 65534     0 May 23 17:28 proc
    dr-xr-xr-x 13  65534 65534     0 May 23 17:28 sys
    / # mkdir /tmp
    / # chmod 1777 /tmp
    / #

Switch to Debian testing

The checks above need to be repeated on Debian 11, which, at the time of writing, was not released yet. However, Debian 11 RC2 was out and could be used to test.

Unfortunatey, I got a bit diverted porting PCMS to Debian 11. Furthermore, that necessitated some changes to my Subversion repositories to support password-less svn-updates.

Finally, I got unblocked, cloned all the above content below … and continued!

Prologue

I did not want to contaminate my PM virtualisation servers (fiori and torchio) so I used a KVM VM (testaroli) to experiment on. I Installed Debian 11 RC2 on this VM.

Installing LXC

  1. As per Debian’s LXC documentation, I ran:
    apt-get install -y lxc libvirt0 libpam-cgfs bridge-utils uidmap libvirt-clients \
                       netcat-openbsd libvirt-daemon libvirt-daemon-system
  2. In order to text Debian 11’s LXC using Debian 11’s virt-manager I ran:
    apt-get -y install virt-manager

    (Any interaction, screenshots, etc with virt-manager now refers to Debian 11’s virt-manager.)

  3. I added a new connection to virt-manager: local, type: LXC.
  4. To support using virsh I also ran in a terminal:
    # if working remotely
    export LIBVIRT_DEFAULT_URI=lxc+ssh://root@testaroli.pasta.net/
    # if working locally
    export LIBVIRT_DEFAULT_URI=lxc:///

Exploring isolation

  1. I use virt-manager to create a new application container named vm1 to run the default application (/bin/sh), no network interfaces (if you have defined networks then you will actually have to remove an interface), the default 1GB RAM and no filesystems (but note that one filesystem mapping is automatically created when the container is started) and no user mapping:
  2. In the VM I ran:
    /bin/sh: 0: can't access tty; job control turned off
    # exec bash
    root@vm2:/#
  3. To test IPC isolation I ran this in the host:
    testaroli# perl -e 'use IPC::SysV qw(IPC_CREAT); $IPC_KEY = 1234; $id = semget($IPC_KEY, 10, 0666 | IPC_CREAT);'
    testaroli# ipcs -s
    
    ------ Semaphore Arrays --------
    key        semid      owner      perms      nsems     
    0x000004d2 1          root       666        10        
    
    testaroli# 
    

    (that code is taken straight from the perlipc(1) man page) and in the container I ran:

    # ipcs -s
    
    ------ Semaphore Arrays --------
    key        semid      owner      perms      nsems     
    
    # 
    

    From that I concluded that there is IPC isolation by default.

  4. To test free memory isolation I ran the following in the host:
    testaroli# head -1 /proc/meminfo
    MemTotal: 4024000 kB
    testaroli#

    and then this in the container:

    root@vm2:/# head -1 /proc/meminfo 
    MemTotal:        1048576 kB
    root@vm2:/#x

    From this I concluded there is memory isolation by default, i.e. this bug has been fixed in Debian 11 RC2.

  5. To test process isolation I ran the following in the host:
    testaroli# ps -ef | wc -l
    125
    testaroli# sleep 300 &
    [1] 8585
    testaroli# 
    

    and then this in the container:

    root@vm2:/# ps -ef | wc -l
    4
    root@vm2:/# kill 8585
    bash: kill: (8585) - No such process
    root@vm2:/# ls /proc/8585 
    ls: cannot access '/proc/8585': No such file or directory
    root@vm2:/#

    From this I concluded there was process isolation.

  6. To test device isolation I ran the following in the host:
    testaroli# ls -lL /dev/vg0/root
    brw-rw---- 1 root disk 253, 1 Jul 15 11:07 /dev/vg0/root
    testaroli# mknod /tmp/vg0-root-host b 253 0
    testaroli# chmod 666 /tmp/vg0-root-host
    testaroli#

    and then this in the container:

    root@vm2:/# mknod /tmp/vg0-root-container b 253 0
    mknod: /tmp/vg0-root-container: Operation not permitted
    root@vm2:/# ls -ld /tmp/vg0-root-host 
    brw-rw-rw- 1 root root 253, 0 Jul 15 12:04 /tmp/vg0-root-host
    root@vm2:/# dd if=/tmp/vg0-root-host of=/dev/null bs=1M count=1
    dd: failed to open '/tmp/vg0-root-host': Permission denied
    root@vm2:/# lspci
    00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
    00:01.0 VGA compatible controller: Red Hat, Inc. QXL paravirtual graphic card (rev 04)
    00:02.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
    00:02.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
    ...
    root@vm2:/# 
    

    and then again on the host:

    testaroli# rm /tmp/vg0-root-host
    testaroli#

    From this I concluded that there is limited device isolation; devices appear to be visible but not readable or writable.

  7. To test network isolation, I first ran this in the container to demonstrate that by default there is no isolation:
    root@vm2:/# echo 'nameserver 8.8.8.8' > /etc/resolv.conf
    root@vm2:/# ping -c 3 www.de
    ping: socket: Address family not supported by protocol
    PING www.de (103.224.182.245) 56(84) bytes of data.
    64 bytes from lb-182-245.above.com (103.224.182.245): icmp_seq=1 ttl=45 time=173 ms
    64 bytes from lb-182-245.above.com (103.224.182.245): icmp_seq=2 ttl=45 time=174 ms
    64 bytes from lb-182-245.above.com (103.224.182.245): icmp_seq=3 ttl=45 time=174 ms
    
    --- www.de ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 2003ms
    rtt min/avg/max/mdev = 173.399/173.768/174.396/0.445 ms
    root@vm2:/# 
    

    and then used virt-manager on fiori to add a non-NAT-ed network to testaroli and added an interface to this network to the container:

    # exec bash
    root@vm2:/# ifconfig eth0 192.168.100.131 netmask 255.255.255.0 up
    root@vm2:/# nmap -sP 192.168.100.0/24
    Starting Nmap 7.80 ( https://nmap.org ) at 2021-07-15 13:02 CEST
    Nmap scan report for 192.168.100.1
    Host is up (0.000057s latency).
    MAC Address: 52:54:00:35:65:F7 (QEMU virtual NIC)
    Nmap scan report for 192.168.100.131
    Host is up.
    Nmap done: 256 IP addresses (2 hosts up) scanned in 28.04 seconds
    root@vm2:/# route add -net default gw 192.168.100.1
    root@vm2:/# ping 8.8.8.8
    ping: socket: Address family not supported by protocol
    PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
    From 192.168.100.1 icmp_seq=1 Destination Port Unreachable
    From 192.168.100.1 icmp_seq=2 Destination Port Unreachable
    From 192.168.100.1 icmp_seq=3 Destination Port Unreachable
    From 192.168.100.1 icmp_seq=4 Destination Port Unreachable
    ^C
    --- 8.8.8.8 ping statistics ---
    4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3058ms
    
    root@vm2:/# 
    

    Then I shutdown the container and removed its network interface.
    From this I concluded that adding network interfaces did provide network isolation.

  8. Note that above, in the ping outputs, both the VM and the host said:
    ping: socket: Address family not supported by protocol

    This seems to be due to a recently-introduced bug in ping. Using option -4 will avoid the warning.

  9. On the host I created a container-specific subdirectory and added that as the root filesystem:
    testaroli# mkdir /var/lib/libvirt/images/vm2
    testaroli#


    but  – to be expected really – that fails to start because /bin/sh does not exist within it:

    Error starting domain: internal error: guest failed to start: Failure in libvirt_lxc startup: cannot find init path '/bin/sh' relative to container root: No such file or directory

    and after putting a (static) shell in place:

    testaroli# mkdir /var/lib/libvirt/images/vm2/bin
    testaroli# cp /usr/bin/busybox /var/lib/libvirt/images/vm2/bin/sh 
    testaroli#

    and then on the container:

    BusyBox v1.30.1 (Debian 1:1.30.1-6+b2) built-in shell (ash)
    Enter 'help' for a list of built-in commands.
    
    /bin/sh: can't access tty; job control turned off
    / # find / -xdev | wc -l
    7
    / #

    From this I concluded that adding filesystems did provide filesystem isolation.

  10. (Test of the isolation of used memory removed; I believe this is a redundant test.)
  11. To test UID/GID isolation, I went back to the most basic container without network interfaces:then I started the container and inside it ran:
    root@vm2:/# touch /tmp/xxx
    root@vm2:/# ls -ld /tmp/xxx
    -rw-r--r-- 1 root root 0 Jul 15 15:21 /tmp/xxx
    root@vm2:/# 
    

    and compared the ownership of that new file as seen from the host:

    testaroli# ls -ld /tmp/xxx
    -rw-r--r-- 1 root root 0 Jul 15 15:21 /tmp/xxx
    testaroli#
  12. So by default there is no user namespace isolation. Stop the container.
  13. But there is a problem: compare the screenshot above with the one from Debian 10: the one from Debian 10 has widgets for configuring user namespaces, but the one from Debian 11 does not. I suspect this is due to these changes. However, the same can done with virt-xml.
  14. Run:
    testaroli# virt-xml vm2 --edit --idmap uid_start=0,uid_target=1000,uid_count=10,gid_start=0,gid_target=1000,gid_count=10
    Domain 'vm2' defined successfully.
    testaroli#

    (Note that in Debian 10’s virt-manager it is possible to configure user namespaces; in Debian 11 RC2’s it was not possible because the relevant widgets were not present in the GUI; in Debian 11 RC2 plus a couple of weeks (virt-manager 1:3.2.0-3) it was again possible to configure user namespaces..

  15. Run:
    testaroli# virsh start vm2
    error: Failed to start domain 'vm2'
    error: internal error: guest failed to start: Failure in libvirt_lxc startup: Failed to mount /sys on /sys type sysfs flags=0x100e: Invalid argument
    
    
    testaroli# 
    

    It could be that this is the userns/netns problem described in the Debian 10 section above. So I added a NIC and tried to start it again:

    testaroli# virsh start vm2
    Domain 'vm2' started
    
    testaroli#

    and repeating the touch+ls -ld check now in the container:

    root@vm2:/# touch /tmp/ttt
    root@vm2:/# ls -ld /tmp/ttt
    -rw-r--r-- 1 root root 0 Jul 15 15:38 /tmp/ttt
    root@vm2:/#

    and in the host:

    testaroli# ls -ld /tmp/ttt
    -rw-r--r-- 1 1000 1000 0 Jul 15 15:38 /tmp/ttt
    testaroli#

    Great!

  16. From this I concluded that enabling user namespaces does add isolation but see Questions below.

Storing container filesystems in images in a type ‘fs’ pool

Due to the userns-enabled-and-netns-disabled issue described earlier, these experiments were done with a userns-enabled-and-netns-enabled container.

Ultimately, I do not want to use this storage configuration, but I do want to test one step at at time!

  1. On fiori/torchio I allocated a 100GB DRBD volume and gave it to testaroli as vdb.
  2. On testaroli I ran:
    testaroli# pvcreate /dev/vdb
    testaroli# vgcreate vg1 /dev/vdb
    testaroli# lvcreate --extents=100%FREE --name=containers vg1
    testaroli# mkfs -t ext4 /dev/vg1/containers
    testaroli#
  3. I left it unmounted and used virt-manager to created a storage pool with name ‘pool1’, of type ‘fs’ (preformatted block device) out of it; this way libvirt could manage the mounting/unmounting. The mountpoint was to be /var/lib/libvirt/images/pool1.
  4. But the pool can only be mounted on one of my virtualisation servers so that would mean all VMs (in the pool) would have to be run on the same virtualisation server, which is not what I want.
  5. Alternatively, I could have one pool per virtualisation server but this is also not flexible regarding which VMs could be run where.
  6. Option rejected without further testing. Cleaned up.

Storing container filesystems in a ‘file’ filesystem

Due to the userns-enabled-and-netns-disabled issue described earlier, these experiments were done with a userns-enabled-and-netns-enabled container.

I wanted to do a ‘block’ type filesystem for a single container. This would have required create a DRBD-over-LV device again, but not to act as a pool, but for a single container. However, as a first step I though ot testing a ‘file’ filesystem. If this worked then hopefully a ‘block’ type volume would work.

  1. On testaroli I ran:
    testaroli# dd if=/dev/zero of=/var/lib/libvirt/images/vm2.img bs=1M count=1000
    testaroli# mkfs -t ext4 /var/lib/libvirt/images/vm2.img
    testaroli# losetup /dev/loop0 /var/lib/libvirt/images/vm2.img
    testaroli# mount /dev/loop0 /mnt
    testaroli# mkdir /mnt/bin
    testaroli# cp /bin/busybox /mnt/bin/sh
    testaroli# umount /mnt
    testaroli# losetup -d /dev/loop0
    testaroli#
  2. I added a filesystem to the container:
  3. I tried to start the container but got:
    testaroli# virsh start vm2
    error: Failed to start domain 'vm2'
    error: internal error: guest failed to start: Failure in libvirt_lxc startup: Failed to create /run/libvirt/lxc/vm2.root: Permission denied
    
    
    testaroli#

    (Remember that user namespaces are still enabled.)

  4. I’m certain that disabling user namespaces would make this work, but username spaces is critical for security. Option reject. Cleaned up.

Other investigations

Other investigations show that NFS won’t work in unprivileged containers.

It is very well documented that privileged containers are not secure.

Conclusions #1

  • I need to rethink volume management
  • I could use it only for applications with no NFS requirements

Storage for OS containers

If, at the end of this investigation, I decide to use libvirt+LXC, then I would have a large amount of free space on fiori/torchio available to allocate to containers.

For testing on testaroli, I allocate a large DRBD-over-LV-based device on fiori, attach it to testaroli and prepare to allocate it to containers.

  1. On fiori/torchio I allocated a 100GB DRBD volume called drbd_testaroli_containers and gave it to testaroli as vdb.
  2. On testaroli I ran:
    testaroli# pvcreate /dev/vdb
    testaroli# vgcreate vg1 /dev/vdb
    testaroli#

Storage for a particular OS container

If, at the end of this investigation, I decide to use libvirt+LXC, then I would allocate an OS-sized LV on fiori and torchio, put a DRBD device on it and – most import! – mount it manually (as libvirt won’t be able to, as it will run an unprivileged container).

For testing on testaroli, I allocate an OS-sized LV, put a filesystem on it and – most import! – mount it manually (as libvirt won’t be able to, as it will run an unprivileged container).

  1. On testaroli I ran:
    lvcreate --size=15g --name=vm3_root vg1
    mkfs -t ext4 /dev/vg1/vm3_root
    

Pools?

At this point I wondered: though libvirt+LXC cannot start volumes for an unprivileged container (because of the give-up-root-privileges-before-mount problem described above), it can probably start a pool, which is really not VM/container-specific, but which I could make to be container- or container-fs-specific.

If I made per-container pools based on the OS-sized LV (or DRBD-over-LV if investigation outcome positive) then I could put disk images inside it (which would have to be manually mounted because …).

Pools test #1

  1. Rename the just-created LV to be container-specific rather than container-fs-specific:
    lvrename /dev/vg1/vm3_root /dev/vg1/vm3
  2. Put a filesystem on it:
    mkfs -t ext4 /dev/vg1/vm3
  3. Allocate that formatted volume to a pool:
    virsh pool-define-as --name=vm3 --type=fs --source-dev=/dev/vg1/vm3 --target=/var/lib/libvirt/images/vm3
    virsh pool-start vm3

    (That will also start the pool, as illustrated by a mount or df call.)

  4. Create a disk image inside the pool ready for a simple test application container:
    dd if=/dev/zero of=/var/lib/libvirt/images/vm3/root.img bs=1M count=10240
    mkfs -t ext4 /var/lib/libvirt/images/vm3/root.img
    virsh pool-refresh vm3
    mount /var/lib/libvirt/images/vm3/root.img /mnt
    mkdir /mnt/bin
    cp /bin/busybox /mnt/bin/sh
    chown -R 1000:1000 /mnt/
    umount /mnt
    
  5. Create a simple application container as a test, use the virt-xml trick to make it unprivileged (mapping root to 1000), attach a filesystem using loop device to access the just-created disk image, start the container. It fails, unable to mount the disk image.

Pools test #2

  1. If I do the mounts manually:
    mkdir /var/lib/libvirt/images/vm3/root
    mount /var/lib/libvirt/images/vm3/root.img /var/lib/libvirt/images/vm3/root

    (note that the loop device is selected automatically) and adjust the filesystem type in libvirt type to mount,  then it starts.

  2. If I add a second disk image for /home:
    dd if=/dev/zero of=/var/lib/libvirt/images/vm3/home.img bs=1M count=1024
    mkfs -t ext4 /var/lib/libvirt/images/vm3/home.img
    mount /var/lib/libvirt/images/vm3/home.img /mnt
    chown -R 1000:1000 /mnt/
    umount /mnt
    mkdir /var/lib/libvirt/images/vm3/home
    mount /var/lib/libvirt/images/vm3/home.img /var/lib/libvirt/images/vm3/home
    virsh pool-refresh vm3

    and attach the filesystem:

    virt-xml vm3 --add --filesystem type=mount,source=/var/lib/libvirt/images/vm3/home,target=/home

    (I think adding filesystems using virsh is not possible); this seems to work as desired:

    / # df -h
    Filesystem                Size      Used Available Use% Mounted on
    /dev/loop0                9.7G      2.0M      9.2G   0% /
    devfs                    64.0K         0     64.0K   0% /dev
    /dev/loop1              973.4M     24.0K    906.2M   0% /home
    / #
  3. I tried to run adduser (this is discussed in more detail below):
    # adduser test1
    Adding user `test1' ...
    Adding new group `test1' (1000) ...
    Adding new user `test1' (1000) with group `test1' ...
    Creating home directory `/home/test1' ...
    Stopped: chown 1000:1000 /home/test1: Invalid argument
  4. I shut down the container, erased the contents of root and home, including dot files, but did not delete the directories themselves as they are active mountpoints, changed the owner/group back to root/root.

Back to user namespaces

Regarding the attempt above to run adduser, it looks like I need to extend the count parameter in the user namespace. Perhaps a good model would be:

start=0
target=100000 * ( <vm-index> + 1 )
count=65536

for both UID and GID mapping, where <vm-index> starts at 0. Values for count higher than 65536 are saved in the XML but shown as 65536 in virt-manager.

The container creation tools lxc-create and debootstrap create hierarchies with root@container-owned files belonging to root@host, which is suitable for a privileged container, but not for an unprivileged container, so we need a script to shift the UIDs and GIDs.

shift-uid-gid is a suitable script; be sure to read comments in it before you run it.

Back to installing an OS in the container

  1. I used lxc-create to create the OS container with this command:
    lxc-create \
        --bdev dir --dir /var/lib/libvirt/images/vm3/root -n root \
        --mirror=http://deb.debian.org/debian \
        -t debian -- --release=testing
    rm -fr /var/lib/lxc/root /var/cache/lxc/debian/rootfs-testing-amd64

    (For clarity, I didn’t want lxc-create to cache the constructed root in /var/lib/lxc/<container-os>, but could not find how to disable it so just rm-ed it manually. Also, I didn’t want to use LXC directly, so removed the configuration that lxc-create had created.)

  2. I downloaded the shift-uid-gid script and used it to shift the owner/group of all files in the container:
    wget https://svn.pasta.freemyip.com/main/virttools/trunk/shift-uid-gid/bin/shift-uid-gid
    chmod 755 shift-uid-gid
    ./shift-uid-gid /var/lib/libvirt/images/vm3/root/ 100000 100000
  3. I then backed that container directory up elsewhere for later reference (see below):
    rsync -a /var/lib/libvirt/images/vm3/root/ /tmp/root.lxc-create/
  4. I created an OS container using this root:
    • username space (0,100000,65536,0,100000,65536)
    • NIC plumbed into br0
    • filesystem /var/lib/libvirt/images/vm3/root mapped to /
  5. There were a few errors during startup:
    Welcome to Debian GNU/Linux 11 (bullseye)!
    
    Set hostname to .
    Couldn't move remaining userspace processes, ignoring: Input/output error
    ...
    systemd-journald-audit.socket: Failed to create listening socket (audit 1): Operation not permitted
    systemd-journald-audit.socket: Failed to listen on sockets: Operation not permitted
    systemd-journald-audit.socket: Failed with result 'resources'.
    [FAILED] Failed to listen on Journal Audit Socket.
    See 'systemctl status systemd-journald-audit.socket' for details.
    ...
    systemd-journald.service: Attaching egress BPF program to cgroup /sys/fs/cgroup/machine.slice/machine-lxc\x2d270139\x2dvm3.scope/system.slice/systemd-journald.service failed: Invalid argument
             Starting Journal Service...
             Starting Remount Root and Kernel File Systems...
    [  OK  ] Mounted POSIX Message Queue File System.
    sys-kernel-debug.mount: Mount process exited, code=exited, status=32/n/a
    sys-kernel-debug.mount: Failed with result 'exit-code'.
    [FAILED] Failed to mount Kernel Debug File System.
    See 'systemctl status sys-kernel-debug.mount' for details.
    sys-kernel-tracing.mount: Mount process exited, code=exited, status=32/n/a
    sys-kernel-tracing.mount: Failed with result 'exit-code'.
    [FAILED] Failed to mount Kernel Trace File System.
    See 'systemctl status sys-kernel-tracing.mount' for details.
    [  OK  ] Finished Helper to synchronize boot up for ifupdown.
    sys-kernel-config.mount: Mount process exited, code=exited, status=32/n/a
    sys-kernel-config.mount: Failed with result 'exit-code'.
    [FAILED] Failed to mount Kernel Configuration File System.
    See 'systemctl status sys-kernel-config.mount' for details.
    ...
    Debian GNU/Linux 11 root console
    
    root login: 
    
  6. I then shutdown the container, deleted the real root’s contents, changed the ownership of the mountpoint back to root:root.
  7. Then I installed it again, but using debootstrap:
    debootstrap --verbose --variant=minbase --arch=amd64 \
        --include=init,ifupdown,locales,dialog,isc-dhcp-client,netbase,net-tools,iproute2 \
        --keyring=/usr/share/keyrings/debian-archive-keyring.gpg testing \
        /var/lib/libvirt/images/vm3/root http://deb.debian.org/debian
  8. I used the shift-uid-gid script to shift the owner/group of all files in the container:
    ./shift-uid-gid /var/lib/libvirt/images/vm3/root/ 100000 100000
  9. I then backed that container directory up elsewhere for later reference (see below):
    rsync -a /var/lib/libvirt/images/vm3/root/ /tmp/root.debootstrap/
  10. The errors during startup appeared to be the same as above.
  11. I used FAD to compare the backup of the lxc-create-created root and the debootstrap-created root; the main differences were:
    • lxc-create configured the NIC to use DHCP; debootstrap left the NIC unconfigured
    • lxc-create added security to /etc/apt/sources.list
    • lxc-create installed openssh-server (plus prerequisites), /etc/inittab, getty startup on several VTs, sensible-utils
  12. I recreated the root again using debootstrap with some slight modifications:
    debootstrap --verbose --variant=minbase --arch=amd64 \
        --include=init,ifupdown,locales,dialog,isc-dhcp-client,netbase,net-tools,iproute2,openssh-server,sensible-utils \
        --keyring=/usr/share/keyrings/debian-archive-keyring.gpg testing \
        /var/lib/libvirt/images/vm3/root http://deb.debian.org/debian
    echo -e 'auto lo\niface lo inet loopback\n\nauto eth0\niface eth0 inet dhcp' > /var/lib/libvirt/images/vm3/root/etc/network/interfaces
  13. Then, as before, I did the UID/GID shift, backed it up, started the container (the errors during startup appeared to be the same as above) and finally used FAD to compare the backup of the lxc-create-created root and this second debootstrap-created root; there were much much fewer differences (and now I could see some that I must have missed the first time):
    • lxc-create configured /etc/hostname correctly; debootstrap reused the host’s hostname
    • lxc-create created /etc/inittab (spawning multiple getty processes)

Conclusions #2

  • I should use lxc-create to create an OS container, rather than using debootstrap,  because:
    1. it leaves me with fewer commands to run manually
    2. it offloads investigative work from me

Recap

Regarding this procedure, note:

  • the repeated mount/umount, this is done to (a) make the steps reentrant, (b) make clear that the final start/stop requires then mount/umount
  1. Set some variables:
    export LIBVIRT_DEFAULT_URI=lxc:///    #  make sure virsh understands to create LXC VMs not KVM VMs
    NAME=vm4
    UID_GID_SHIFT=200000
    ROOT_PASSWORD=xxxxxxxx                 #  set to desired root password for container
  2. Get some scripts we’ll need:
    mkdir -p ~/opt
    svn co https://svn.pasta.freemyip.com/main/virttools/trunk ~/opt/virttools
  3. Create the pool and start it:
    lvcreate --size=15g --name=$NAME vg1
    mkfs -t ext4 /dev/vg1/$NAME
    #  don't contaminate default pool with pools!
    mkdir -p /var/lib/libvirt/pools/$NAME
    virsh pool-define-as --name $NAME --type fs --source-dev /dev/vg1/$NAME --target /var/lib/libvirt/pools/$NAME
    virsh pool-start $NAME
    
  4. Create and populate the root volume (including setting root’s password):
    dd if=/dev/zero of=/var/lib/libvirt/pools/$NAME/root.img bs=1M count=5120
    mkfs -t ext4 /var/lib/libvirt/pools/$NAME/root.img
    mkdir /var/lib/libvirt/pools/$NAME/root
    mount /var/lib/libvirt/pools/$NAME/root.img /var/lib/libvirt/pools/$NAME/root
    lxc-create \
        --bdev=dir \
        --dir=/var/lib/libvirt/pools/$NAME/root \
        --name=$NAME \
        --template=debian \
        -- \
        --mirror=http://deb.debian.org/debian \
        --release=testing
    echo -e "$ROOT_PASSWORD\n$ROOT_PASSWORD" | \
        chroot /var/lib/libvirt/pools/$NAME/root passwd root
    umount /var/lib/libvirt/pools/$NAME/root
    rm -fr /var/lib/lxc/$NAME
    rm -fr /var/cache/lxc/debian/rootfs-testing-amd64
  5. Shift the owner/groups:
    mount /var/lib/libvirt/pools/$NAME/root.img /var/lib/libvirt/pools/$NAME/root
    ~/opt/virttools/shift-uid-gid/bin/shift-uid-gid /var/lib/libvirt/pools/$NAME/root $UID_GID_SHIFT $UID_GID_SHIFT
    umount /var/lib/libvirt/pools/$NAME/root
  6. Define the basic container:
    virsh define <(~/opt/virttools/create-basic-lxc-vm/bin/create-basic-lxc-vm $NAME $UID_GID_SHIFT)
  7. To add a /home volume run:
    dd if=/dev/zero of=/var/lib/libvirt/pools/$NAME/home.img bs=1M count=1024
    mkfs -t ext4 /var/lib/libvirt/pools/$NAME/home.img
    mkdir /var/lib/libvirt/pools/$NAME/home
    mount /var/lib/libvirt/pools/$NAME/home.img /var/lib/libvirt/pools/$NAME/home
    ~/opt/virttools/shift-uid-gid/bin/shift-uid-gid /var/lib/libvirt/pools/$NAME/home $UID_GID_SHIFT $UID_GID_SHIFT
    virt-xml $NAME --add --filesystem type=mount,source=/var/lib/libvirt/pools/$NAME/home,target=/home
    umount /var/lib/libvirt/pools/$NAME/home
    
  8. Start the container with:
    mount /var/lib/libvirt/pools/$NAME/root.img /var/lib/libvirt/pools/$NAME/root
    mount /var/lib/libvirt/pools/$NAME/home.img /var/lib/libvirt/pools/$NAME/home
    virsh start $NAME
  9. Log in as root on the console and do a simple network check with:
    timeout 10 cat > /dev/tcp/www.google.com/80
    exit

    (Not much is installed yet, not even ping, so a more sophisticated check will have to wait.)

  10. Shut the container down with:
    virsh shutdown $NAME
    umount /var/lib/libvirt/pools/$NAME/root
    umount /var/lib/libvirt/pools/$NAME/home
  11. Shut the pool down:
    virsh pool-destroy $NAME

Boot-time error messages

During container startup:

  • a kernel bug logged as a systemd bug  causes this:
    Couldn't move remaining userspace processes, ignoring: Input/output error
  • a systemd bug causes this and some related error messages:
    systemd-journald-audit.socket: Failed to listen on sockets: Operation not permitted
  • another systemd bug causes these:
    sys-kernel-debug.mount: Mount process exited, code=exited, status=32/n/a
    sys-kernel-debug.mount: Failed with result 'exit-code'.
    [FAILED] Failed to mount Kernel Debug File System.
    See 'systemctl status sys-kernel-debug.mount' for details.
    sys-kernel-tracing.mount: Mount process exited, code=exited, status=32/n/a
    sys-kernel-tracing.mount: Failed with result 'exit-code'.
    [FAILED] Failed to mount Kernel Trace File System.
    See 'systemctl status sys-kernel-tracing.mount' for details.
    sys-kernel-config.mount: Mount process exited, code=exited, status=32/n/a
    sys-kernel-config.mount: Failed with result 'exit-code'.
    [FAILED] Failed to mount Kernel Configuration File System.
    See 'systemctl status sys-kernel-config.mount' for details.

    but these can be addressed by running:

    systemctl mask sys-kernel-debug.mount
    systemctl mask sys-kernel-trace.mount
    systemctl mask sys-kernel-config.mount

Hooks #1

At this point I discovered hooks and arbitrary per-app metadata, which might allow virsh start to mount the disk images inside the pool or even to revert to using block devices as volumes!

Let’s start simple! In the ‘Recap’ section above, the container is started with:

virsh pool-start $NAME
mount /var/lib/libvirt/pools/$NAME/root.img /var/lib/libvirt/pools/$NAME/root
mount /var/lib/libvirt/pools/$NAME/home.img /var/lib/libvirt/pools/$NAME/home
virsh start $NAME

and stopped with a symmetric stanza. I want:

  • the mounts and umounts to be done in a hook
  • maybe to use non-VM-specific pools (meaning that I decouple pool start/stop from container start/stop)

So I wrote an lxc hook called mounter that looks for mounter’s metadata in the VM’s XML and acts upon it.

  1. I should have already downloaded virttools; if not then see above.
  2. Hardlink the mounter plugin:
    cd /etc/libvirt/lxc.d/
    ln ~/opt/virttools/libvirt-hooks/lxc.d/mounter

    Note that symlinking won’t work (error message normally appears all on one line):

    testaroli# virsh start $NAME
    error: Failed to start domain 'vm4'
    error: Hook script execution failed: internal error: 
    Child process (LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin /etc/libvirt/hooks/lxc.d/mounter vm4 prepare begin -)
    unexpected exit status 126: libvirt:  error : cannot execute binary /etc/libvirt/hooks/lxc.d/mounter: Permission denied
    testaroli#
  3. Edit the VM’s XML by running:
    virsh edit $NAME

    and add:

    <uuid>...</uuid>
    <metadata>
      <app1:mounter xmlns:app1="http://app1.org/app1/">
        <mounts>
          <mount type="loop" device="/var/lib/libvirt/pools/vm4/root.img" mntpnt="/var/lib/libvirt/pools/vm4/root"/>
          <mount type="loop" device="/var/lib/libvirt/pools/vm4/home.img" mntpnt="/var/lib/libvirt/pools/vm4/home"/>
        </mounts>
      </app1:mounter>
    </metadata>
    <memory unit='KiB'>...</memory>

    Note that I could not find an virt-xml command to insert that from the command line.

  4. The VM can now be started and stopped with only:
    testaroli# export LIBVIRT_DEFAULT_URI=lxc:///
    testaroli# NAME=vm4
    testaroli# virsh pool-start $NAME
    Pool vm4 started
    testaroli# virsh start $NAME
    Domain 'vm4' started
    testaroli# virsh shutdown $NAME
    Domain 'vm4' is being shutdown
    testaroli# virsh pool-destroy $NAME
    Pool vm4 destroyed
    testaroli#

    Woohoo!

Pools

Although the official documentation extoles the benefits of storage pools and gives plently of examples, it does not made explicit what a pool is and what a pool is not.

However, judging by examples, it looks like there are conventions:

  • multiple volumes may carved out of a single pool and attached to different VMs
  • pools provide storage of a “class”, where the class relates to the nature/name/whatever of the underlying storage, not on the nature/name/whatever of the VMs that use storage from that pool

DRBD sounds like a storage class and therefore sounds a suitable candidate to make a storage pool out of.

However, if I allocated one DRBD to a new storage pool and then put two VMs’ volumes in that pool then, migrating only one of the VMs:

  • would require OCFS/2 (which I found added a substantial performance overhead and very prone to lockups leading to split brain), or
  • another clustered filesystem (which all have more complicated resource managers), or
  • I abuse storage pools by limiting their contents to the volumes for only one VM (which goes against the conventions), or
  • I don’t use pools

The latter is what I currently do and is accepted in the official documentation:

Storage pools and volumes are not required for the proper operation of VMs … but some administrators will prefer to manage their own storage and VMs will operate properly without any pools … On systems that do not use pools, system administrators must ensure the availability of the VMs’ storage using whatever tools they prefer ….

So I will continue to not use pools.

testing DRBD-based container environment

The main aims of this section are:

  • the hook script needs to be extended to handle DRBD devices and/or generic block devices; the difference would be a call to drbdadm primary
  • I want to test my scripts for creating VMs, as I have moved a lot of scripts recently
  • I want to do DRBD-based tests

So here I go:

  1. I modified the hook script to change the ‘loop’ type to use parameters ‘file’ (instead of ‘device’) and ‘mntpnt’
  2. I modified vm4’s metadata accordingly and tested.
  3. I modified the hook script to add a ‘block’ type to use parameters ‘device’ and ‘mntpnt’.
  4. I configured pestaroli the same as testaroli in PCMS (to prepare for installing pestaroli)
  5. I released a new PCMS (to clear the badly formatted messages coming from other hosts and to make reinstalling testaroli cleaner)
  6. I shut down testaroli, backed up its disk images and deleted them
  7. I shut down pestaroli, didn’t back up its disk images and deleted them
  8. I recreated the disk images as follows:
    fiori# ~/opt/virttools/bin/create-perfect-drbd-vol -v --run-remote testaroli 20 192.168.3.6 192.168.3.7
    fiori# ~/opt/virttools/bin/create-perfect-drbd-vol -v --run-remote testaroli_containers 80 192.168.3.6 192.168.3.7
    fiori# ~/opt/virttools/bin/create-perfect-drbd-vol -v --run-remote pestaroli 20 192.168.3.6 192.168.3.7
    fiori# ~/opt/virttools/bin/create-perfect-drbd-vol -v --run-remote pestaroli_containers 80 192.168.3.6 192.168.3.7

    (Ordinarily, I would not wait for the DRBD syncs to complete, but it was near the end of the day and four of them added load to the systems, so I let it run overnight. Also: I realise this means I will have four copies of the contains I put on the containers DRBD device; I don’t care, I want an accurate test eivironment.)

  9. I modified the two VM creator scripts to load msg.sh
  10. create the pestaroli and testaroli VMs, initially with only one disk (as per what the script expects) by running:
    fiori# create-perfect-kvm-vm -v --run-remote pestaroli block:/dev/drbd_pestaroli torchio
    fiori# create-perfect-kvm-vm -v --run-remote testaroli block:/dev/drbd_testaroli torchio
  11. At this point I made a slight digression: I wanted pestaroli and testaroli to replicate over the replication network, not over the public network, but both fiori and torchio have eth1 configured with an IP, rather than being used as a bridge device, so I did the following on both fiori and torchio (though not in parallel):
    1. saved all file/web edits
    2. migrated all running VMs to the other host
    3. reconfigured eth1 as br1 using eth1
    4. rebooted
  12. installed Debian 11 on pestaroli and testaroli, including tailoring and running PCMS.
  13. I installed packages: lxc, libfile-lchown-perl (needed by shift-uid-gid), libvirt-daemon-driver-lxc (systemctl restart libvirtd needed afterwards), xmlstarlet
  14. I shutdown both VMs and:
    1. I added the /dev/drbd_<vm>_containers storage to pestaroli and testaroli.
    2. I added a second interface on the replication network and configure it (actually, I had to do this before running PCMS as it expected the 2nd NICs to exist)
    3. I added corresponding entries to DNS
    4. I increased memory to 4GB
    5. I restarted the the VMs
    6. I set up the new disk device for use by DRBD:
      apt-get -y install drbd-utils
      pvcreate /dev/vdb
      vgcreate vg1 /dev/vdb
  15. I added a new mail server name (mafalde) to DNS
  16. I set up ssh access between the nodes
  17. I created an 20GB DRBD device out of that storage by running:
    create-perfect-drbd-vol -v --run-remote mafalde 20 192.168.3.31 192.168.3.32
  18. In CheckMK, I refreshed the list of services to monitor on pestarol and testaroli.
  19. Regarding where container-specific devices will be mounted (using the hook):
    1. I could find no suggestions on the web, so decided upon /var/lib/libvirt/mountpoints
    2. On both pestaroli and testaroli, I checked out the module containing the hook and hooked it in as follows:
      mkdir -p ~/opt
      svn co https://svn.pasta.freemyip.com/main/virttools/trunk ~/opt/virttools 
      mkdir -p /etc/libvirt/hooks/lxc.d
      ln ~/opt/virttools/bin/etc-libvirt-hooks-lxc.d-mounter /etc/libvirt/hooks/lxc.d/mounter

      (As above, the link must be a hardlink, not a symlink.)

  20. Create the container by running:
    drbdadm primary drbd_mafalde
    create-basic-lxc-vm -v --remote=testaroli mafalde /dev/drbd_mafalde 100000
  21. Start the container with:
    virsh --connect=lxc:/// start mafalde
  22. I modified vm-list and vm-migrate to support LXC, msg.sh, pestaroli/testaroli.
  23. I used vm-migrate migrate mafalde backwards and forwards.
  24. Prior to setting up PCMS I installed a few packages that were missing and that the PCMS procedure expected to already be there. The installation of these packages could become part of the installing-a-container procedure or a part of the installing-pcms procedure
    1. vim: needed to edit /etc/ssh/sshd_config; PCMS page also says to install it
    2. wget: needed to download svn-cache-passwd
    3. apt-utils: pcms calls apt-get, which constantly complains:
      debconf: delaying package configuration, since apt-utils is not installed
      Selecting previously unselected package apt-utils.
    4. xterm: useful as contains resize, which is useful for serial terminal (LXC console)
  25. PCMS the container first as just a dumb server according to Installing and running PCMS.
  26. Disable various mounts that won’t work in containers:
    systemctl mask sys-kernel-debug.mount
    systemctl mask sys-kernel-tracing.mount
    systemctl mask sys-kernel-config.mount
    systemctl mask systemd-journald-audit.socket
  27. I saw there was no syslog process inside the container, so installed rsyslogd.
  28. I submitted bug report BTS#991773, regarding attempts to reboot a container shutting in down instead.

Misaligned packaged lists

  1. There are a lot of differences between the list of packages on pestaroli and on mafalde. Why? Here’s the list:
    pestaroli# comm -1 -3 <(dpkg -l | sed -nr -e '1,5d' -e 's/^ii  ([^ ]+).*/\1/p' | sort) <(chroot /var/lib/libvirt/mountpoints/mafalde/root dpkg -l | sed -nr -e '1,5d' -e 's/^ii  ([^ ]+).*/\1/p' | sort) | egrep -v '^(lib|lxc|grub)'
    dialog
    fontconfig
    fonts-droid-fallback
    fonts-noto-mono
    fonts-urw-base35
    ghostscript
    groff
    gsfonts
    hicolor-icon-theme
    imagemagick
    imagemagick-6-common
    imagemagick-6.q16
    mesa-vulkan-drivers:amd64
    netpbm
    poppler-data
    psutils
    systemd-timesyncd
    x11-utils
    pestaroli# comm -2 -3 <(dpkg -l | sed -nr -e '1,5d' -e 's/^ii  ([^ ]+).*/\1/p' | sort) <(chroot /var/lib/libvirt/mountpoints/mafalde/root dpkg -l | sed -nr -e '1,5d' -e 's/^ii  ([^ ]+).*/\1/p' | sort) | egrep -v '^(lib|lxc|grub)'
    apt-listchanges
    arch-test
    augeas-lenses
    bash-completion
    bind9-dnsutils
    busybox-static
    cloud-image-utils
    console-setup
    console-setup-linux
    cron
    debconf-i18n
    debian-faq
    debootstrap
    discover
    discover-data
    distro-info
    dnsmasq
    doc-debian
    drbd-utils
    efibootmgr
    fakechroot
    fakeroot
    fdisk
    firmware-linux-free
    genisoimage
    gpm
    iamerican
    ibritish
    ienglish-common
    initramfs-tools
    initramfs-tools-core
    installation-report
    ipxe-qemu
    isc-dhcp-common
    iso-codes
    ispell
    kbd
    keyboard-configuration
    keyutils
    klibc-utils
    krb5-locales
    laptop-detect
    linux-base
    linux-image-5.10.0-7-amd64
    linux-image-5.10.0-8-amd64
    linux-image-amd64
    mailcap
    manpages
    mime-support
    mmdebstrap
    mokutil
    nano
    ncurses-term
    netcat-openbsd
    netcat-traditional
    nfs-common
    ntp
    ntpdate
    os-prober
    ovmf
    python3-debconf
    qemu-guest-agent
    qemu-system-common
    qemu-system-data
    qemu-system-x86
    qemu-utils
    rpcbind
    seabios
    shim-helpers-amd64-signed
    shim-signed-common
    shim-signed:amd64
    shim-unsigned
    systemd-container
    task-english
    tasksel
    tasksel-data
    uidmap
    util-linux-locales
    vim-tiny
    wamerican
    whiptail
    xauth
    xkb-data
    xmlstarlet
    pestaroli#
  2. Actually, it is not so bad, if I go back to pre-PCMS-ed systems, the the KVM system not install any tasksel groups then:
    1. I installed openssh-server on the KVM system
    2. I installed vim on the LXC system
    3. I installed debfoster on both
    4. I ran debfoster answers ‘yes’ to all packages and then diffed as follows (192.168.1.156 is KVM and 192.168.1.157 is LXC):
      lagane$ #  only on KVM
      lagane$ comm -2 -3 <(ssh root@192.168.1.156 sort /var/lib/debfoster/keepers) <(ssh root@192.168.1.157 sort /var/lib/debfoster/keepers)
      X11 forwarding request failed on channel 0
      console-setup
      discover
      eject
      grub-efi-amd64
      installation-report
      libreadline8
      linux-image-5.10.0-7-amd64
      linux-image-amd64
      lvm2
      qemu-guest-agent
      task-english
      usbutils
      whiptail
      lagane$ # only on LXC
      lagane$ comm -1 -3 <(ssh root@192.168.1.156 sort /var/lib/debfoster/keepers) <(ssh root@192.168.1.157 sort /var/lib/debfoster/keepers)
      X11 forwarding request failed on channel 0
      dialog
      locales
      lagane$
    5. Things to explain are:
      1. why is libreadline8, task-english, qemu-guest-agent, usbutils, whiptail on the KVM system?
      2. why is dialog and locales on the LXC system? I think lxc-create might install these?
      3. the rest are hardware or kernel related and therefore not appropriate to a container.
    6. Explanations/response:
      1. locales is on both systems, but it gets pulled in as a dependency on KVM one and not on LXC (I think lxc-create command installs it explicitly)
      2. whiptail and dialog are functionally the same; presumably not all packages are aligned in what they prerequire
      3. KVM: remove libreadline8 and taskenglish
      4. LXC: install whiptail; remove dialog
      5. Both XXX and YYY have been updated with tese instructions.
  3. I did another loop of package comparison:
    1. I installed the container and KVM VM installation following the instructions at Installing Debian 11 on a PM or KVM VM and Installing Debian 11 on an LXC container.
    2. I configured sshd_config to allow root login with password
    3. Used ip -o -4 to get the IP addresses
    4. Use ssh-copy-id to copy alexis@lagane’s key to root@ each host
    5. On both systems installed debfoster (that this is missing is not a misalignment because it is missing on both, but I need it for doing the checks)
    6. Ran debfoster -q on both hosts
    7. Compared package lists from lagane:
      lagane$ comm -2 -3 <(ssh root@192.168.1.159 sort /var/lib/debfoster/keepers) <(ssh root@192.168.1.158 sort /var/lib/debfoster/keepers)
      console-setup
      discover
      eject
      grub-efi-amd64
      installation-report
      linux-image-5.10.0-7-amd64
      linux-image-amd64
      lvm2
      qemu-guest-agent
      usbutils
      util-linux-locales
      lagane$ comm -1 -3 <(ssh root@192.168.1.159 sort /var/lib/debfoster/keepers) <(ssh root@192.168.1.158 sort /var/lib/debfoster/keepers)
      net-tools
      whiptail
      lagane$

      Note that, in this particular loop, the KVM VM got the higher IP by DHCP when before it had got the lower.

  4. I considered integrating these next steps into PCMS:
    1. Disable various mounts that won’t work in containers:
      systemctl mask sys-kernel-debug.mount
      systemctl mask sys-kernel-tracing.mount
      systemctl mask sys-kernel-config.mount
      systemctl mask systemd-journald-audit.socket
    2. Align the set of installed packages with a common PM/KVM-VM/LXC baseline (see here for more details) by running:
      apt-get --no-install-recommends install whiptail vim rsyslog
      apt-get --purge autoremove dialog

    but in the end decided not to.

Epilogue

See also