Configuring storage and virtualisation services generation three point one

Introduction

This page describes how Alexis Huxley installed and configures a replicated storage and virtualisation environment using two nodes, DBRD, OCFS2 and KVM.

Note that:

  • all procedures are to be run on both nodes, regardless of whether they are both being configured at the same time or not, unless explicitly stated otherwise
  • basic OS installation, including LVM setup, is not covered here
  • split-brain is a real possibility, but:
    • in a single user environment it is easy to manually merge
    • services provided by VMs continue to operate normally
    • no immediate action is required
    • in six years my setup has split-brained only a couple of times, and never since the introduction of a dedicated cluster network

Local storage volumes

The main part of this procedure deals with replicated storage, but some local storage is always useful (e.g. for VM experiments, scratch space, …).

  1. Create LVs:
    lvcreate --name=local --size=200g vg0
    
  2. Format for XFS, which offers online size changing:
    mkfs -t xfs -f /dev/vg0/local
  3. Add fstab entries for them all as below, create mountpoints and mount them:
    /dev/mapper/vg0-local /vol/local xfs auto,noatime,nodiratime 0 2

    (Note that I do the fstab entry using PCMS, because otherwise the change is reverted.)

Dedicated network interface

You should probably use a dedicated network card for cluster communications in order to ensure that public traffic does not impact replication.

I use traditional NIC naming (i.e. eth0, eth1), which is not persistent. This causes me a problem because I have three NICs in each machine, and the names eth1 and eth2 are effectively randomly assigned to the second and third NICs at each reboot. Therefore persistent naming is required.

  1. Edit /etc/udev/rules.d/70-persistent-net.rules to contain something like:
    SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:0e:0c:c5:f0:6d", \
        ATTR{dev_id}=="0x0", ATTR{type}=="1", NAME="eth1"
    SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="10:fe:ed:05:92:6d", \
        ATTR{dev_id}=="0x0", ATTR{type}=="1", NAME="eth2"

    Note that I don’t bother with an entry for eth0, because that absolutely always is eth0, probably because it is on the systemboard.

  2. Reboot a few times to ensure that the NICs are consistently named the same.
  3. Add a suitable entry to /etc/network/interfaces for the NIC you will use for the cluster communications and add an entry for it to /etc/hosts.

Replicated storage volumes

The procedure differs slightly according to whether both nodes (the nodes providing the storage upon which replicated volumes will be stored) are available at the same time or not. This procedure supports both.

  1. Create LVs:
    lvcreate --name=small    --size=200g vg0
    lvcreate --name=vmpool0  --size=500g vg0
    lvcreate --name=pub      --size=2t   vg0
  2. For the first volume, add a dual-primary DRBD configuration in /etc/drbd.d/drbd-small.res, which references IPs on the dedicated cluster network:
    resource drbd_small {
      protocol  C;
      device    /dev/drbd_small minor 0;
      meta-disk internal;
      disk      /dev/vg0/small;
      net {
        allow-two-primaries;
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        after-sb-2pri disconnect;
      }
      #  Cannot be the backlan name, but should be the backlan IP.
      on fiori {
        address 192.168.3.6:7790;
      }
      #  Cannot be the backlan name, but should be the backlan IP.
      on torchio {
        address 192.168.3.7:7790;
      }
      handlers {
        before-resync-target "/sbin/drbdsetup $DRBD_MINOR secondary";
        split-brain "/usr/lib/drbd/notify-split-brain.sh root";
      }
    }
    
  3. Repeat for the other volumes being care to set the right resource name, device name, incremented device minor number, incremented TCP port numbers.
  4. If both nodes are available at the same time (which avoids long DRBD synchronisations), then:
    1. On both nodes run:
      dd if=/dev/zero of=/dev/drbd_small bs=64M
      drbdadm -- --force create-md drbd_small
      drbdadm up drbd_small
    2. On one node only run:
      drbdadm -- --clear-bitmap new-current-uuid drbd_small
    3. On both nodes run:
      drbdadm primary drbd_small

    otherwise, if both nodes are not available at the same time and this is the first node, then:

    1. Run:
      drbdadm create-md drbd_small
      drbdadm up drbd_small
      drbdadm -- --overwrite-data-of-peer primary drbd_small
    2. Verify that the device is now primary:
      fiori# cat /proc/drbd
        0:drbd0  WFConnection Primary/Unknown UpToDate/DUnknown C r----s
      fiori#

    otherwise, if both nodes were not available at the same time and this is the second node, then:

    1. On the second node run:
      drbdadm create-md drbd_small
      drbdadm up drbd_small
    2. Wait for synchronisation to complete; you can monitor this and get an estimate of when it will complete by running:
      cat /proc/drbd
    3. Run:
      drbdadm primary drbd0_small
  5. Repeat the last step for the other volumes.
  6. Create /etc/ocfs2/cluster.conf, which references IPs on the dedicated cluster network, containing:
    node:
            ip_port = 7777
            ip_address = 192.168.3.6
            number = 1
            name = fiori
            cluster = ocfs2
    
    node:
            ip_port = 7777
            ip_address = 192.168.3.7
            number = 2
            name = torchio
            cluster = ocfs2
    
    cluster:
            node_count = 2
            name = ocfs2
  7. Run:
    dpkg-reconfigure ocfs2-tools

    and when prompted

    Would you like to start an OCFS2 cluster (O2CB) at boot time?

    select Yes and for all following questions accept the default.

  8. Run:
    service o2cb start
    service o2cb online
  9. On one node only run:
  10. mkfs.ocfs2 --fs-feature-level=max-features -T mail /dev/drbd_small
  11. Add an fstab entry as below, create a mountpoint and mount it:
    /dev/drbd_small /voll/small ocfs2  _netdev,noatime,data=writeback,commit=60,nodiratime 0 1
    
  1. If you have existing data to migrate, then migrate them now
  2. Repeat the previous steps the other volumes, adjusting the argument to mkfs.ocfs2’s -T option according the intended usage of the volume (see mkfs.ocfs2(8) for the possibilities).

NFS shares

In order to ensure that VMs mount volumes from the server on which they are running, not from the ‘remote’ node, we define a private network within the server, using the same IP address for the server itself in both servers.

  1. Decide the private network to use and the address the servers will have on that network (remember both servers will use the same IP address!). For the sake of this procedure, let’s assume that the network is 192.168.10.0/24 and the servers will be 192.168.10.1.
  2. Note that, later, when creating VMs:
    1. they must address the NFS server as 192.168.10.1 and not use the NFS server’s public IP address
    2. if address of the NFS server comes from an automounter NIS map and that NIS map is also being used by physical machines, then the VMs will need entries in /etc/hosts to overrule the IP address that the NFS server’s name resolves to
  3. Write a suitable /etc/exports file. As an example here is my own:
    #  login servers get write access to home and pub
    /vol/small/home 192.168.1.0/24(rw,sync,no_root_squash,no_subtree_check) \
                    192.168.10.9(rw,sync,no_root_squash,no_subtree_check)
    /vol/pub        192.168.1.0/24(rw,sync,no_root_squash,no_subtree_check) \
                    192.168.10.9(rw,sync,no_root_squash,no_subtree_check)
    
    #  web servers get read access to home and pub and write access
    #  to svn
    /vol/small/home 192.168.10.8(ro,no_root_squash,no_subtree_check)
    /vol/pub        192.168.10.8(ro,no_root_squash,no_subtree_check)
    /vol/small/svn  192.168.10.8(rw,sync,no_root_squash,no_subtree_check)
    
    #  mail server gets write access to mail
    /vol/small/mail 192.168.10.29(rw,sync,no_root_squash,no_subtree_check)
    
    #  storage gateway gets write access to home and pub (to allow
    #  remote users to do stuff, as well as for backups) and read access to
    #  svn and mail for backups
    /vol/small/home 192.168.10.28(rw,sync,no_root_squash,no_subtree_check)
    /vol/pub        192.168.10.28(rw,sync,no_root_squash,no_subtree_check)
    /vol/small      192.168.10.28(ro,no_root_squash,no_subtree_check)
    
  4. Run:
    exportfs -av

Virtualisation

  1. Run:
    apt-get install qemu-kvm libvirt-bin qemu-utils
  2. Define the storage pools using virsh; e.g.:
    virsh pool-define-as --name=vmpool0   --type=dir \
        --target=/vol/vmpool0
    virsh pool-start vmpool0
    virsh pool-define-as --name=isoimages --type=dir \
        --target=/vol/pub/computing/software/isoimages/os
    virsh pool-start isoimages
    virsh pool-destroy default      #  unfortunately not persistent

    (Since vmpool0 and isoimages are on GlusterFS storage, we choose at this time not to enable autostart on them.)

  3. Define the networks using virsh; e.g.:
    virsh net-destroy default
    #  storage network for VMs (see above)
    virsh net-define <(cat <<EOF
    <network>
      <name>$192.168.10.0</name>
      <uuid>$(uuidgen)</uuid>
      <bridge name='virbr0' stp='on' delay='0'/>
      <mac address='52:54:00:81:cd:08'/>
      <ip address='192.168.10.1' netmask='255.255.255.0'>
      </ip>
    </network>
    EOF
    )
    #  No net definition required to plumb VMs into br0
  4. Set up SSH keys to allow the running of virt-manager from a remote system.
  5. If you have existing VM images and definitions to migrate, then migrate them now.
  6. When creating VMs that will be NFS clients, remember the notes in the ‘NFS shares’ section above!

See also