Introduction
This page describes how Alexis Huxley installed and configures a replicated storage and virtualisation environment providing VM services and NFS services. The setup uses two nodes, DBRD, OCFS2, KVM and the kernel-nfs-server.
Since this is a two node setup, DRBD-split-brain is a real possibility, but if it happens then:
- because the only files on the replicated storage are VM images, it is is clear which node has the correct version of each of the (relatively few) files, so manually merging is relatively simple
- no immediate action is required
- the amount of effort required to correct the error does not increase the longer no action is taken
- services provided by the VMs continue to operate normally
Of course we should minimise the chances of split-brain happening in the first place. I do this by:
- use a dedicated network connection for cluster communications
Actually, a bigger problem used to be freezes caused by OCFS2-NFSd interaction. This procedure was modified in order to insert a PM/VM border between OCFS2 and NFSd, which has solved the problem. For completeness, here is a kernel stack dump from one of the freezes; it might lead googlers to my solution:
[352691.710772] ------------[ cut here ]------------ [352691.712680] kernel BUG at /build/linux-usfZoe/linux-4.4.0/fs/ocfs2/inode.c:1343! [352691.714589] invalid opcode: 0000 [#1] SMP [352691.716487] Modules linked in: vhost_net vhost macvtap macvlan ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6_tables iptable_filter ip_tables x_tables xfs bridge stp llc snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore kvm_amd k10temp serio_raw 8250_fintek kvm i2c_piix4 tpm_infineon shpchp irqbypass mac_hid nfsd drbd auth_rpcgss nfs_acl lockd lru_cache grace libcrc32c sunrpc autofs4 btrfs xor raid6_pq uas usb_storage amdkfd amd_iommu_v2 crct10dif_pclmul crc32_pclmul radeon ghash_clmulni_intel aesni_intel i2c_algo_bit aes_x86_64 ttm lrw gf128mul glue_helper drm_kms_helper ablk_helper cryptd [352691.734756] syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse e1000 r8169 ahci drm mii libahci fjes [352691.738868] CPU: 2 PID: 3533 Comm: nfsd Not tainted 4.4.0-53-generic #74-Ubuntu [352691.740953] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88X-D3H, BIOS F1 08/19/2013 [352691.745100] task: ffff8807ff415280 ti: ffff88008d39c000 task.ti: ffff88008d39c000 [352691.747236] RIP: 0010:[<ffffffffc0954566>] [<ffffffffc0954566>] ocfs2_validate_inode_block+0x116/0x1e0 [ocfs2] [352691.751501] RSP: 0018:ffff88008d39f928 EFLAGS: 00010246 [352691.753619] RAX: 000000000020002c RBX: 0000000000000000 RCX: ffff88024e110548 [352691.755736] RDX: 0000000000000000 RSI: ffff88024e110548 RDI: ffff880807c06800 [352691.757826] RBP: ffff88008d39f968 R08: ffff88008d39fa20 R09: 0000000000000000 [352691.759923] R10: ffff88033956b510 R11: 0000000000000000 R12: ffff88024e110548 [352691.762009] R13: ffff88013366c000 R14: ffff88008d39fa20 R15: ffff880807c06800 [352691.764065] FS: 00007f4d8a2e8c00(0000) GS:ffff88083ed00000(0000) knlGS:0000000000000000 [352691.768056] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [352691.770066] CR2: 0000000000349000 CR3: 00000005bdcfe000 CR4: 00000000000406e0 [352691.772083] Stack: [352691.774057] 0000000029b861f3 ffff88024e110548 0000000029b861f3 0000000000000000 [352691.776100] 0000000000000000 ffff88024e110548 ffff88008d39fa20 0000000000000000 [352691.778103] ffff88008d39fa10 ffffffffc09399d6 0000000000000246 ffff880807c06800 [352691.780096] Call Trace: [352691.782058] [<ffffffffc09399d6>] ocfs2_read_blocks+0x456/0x660 [ocfs2] [352691.784052] [<ffffffffc0954450>] ? ocfs2_remove_inode+0x400/0x400 [ocfs2] [352691.786013] [<ffffffffc09569ea>] ocfs2_read_inode_block_full+0x4a/0x80 [ocfs2] [352691.787956] [<ffffffffc0956edc>] ocfs2_iget+0x4bc/0x6c0 [ocfs2] [352691.789855] [<ffffffffc094b9a3>] ocfs2_get_dentry+0x293/0x440 [ocfs2] [352691.791702] [<ffffffffc0598e70>] ? nfserrno+0x60/0x60 [nfsd] [352691.793530] [<ffffffffc094bbf5>] ocfs2_fh_to_dentry+0x45/0x60 [ocfs2] [352691.795393] [<ffffffff813152d2>] exportfs_decode_fh+0x72/0x2e0 [352691.797147] [<ffffffffc059e3a9>] ? exp_find_key+0x89/0xd0 [nfsd] [352691.798855] [<ffffffff811ec072>] ? kmem_cache_alloc_trace+0x1d2/0x1f0 [352691.800545] [<ffffffff8138ebaf>] ? apparmor_cred_prepare+0x2f/0x50 [352691.802194] [<ffffffff81346cc3>] ? security_prepare_creds+0x43/0x60 [352691.803822] [<ffffffffc0599b5c>] fh_verify+0x34c/0x650 [nfsd] [352691.805448] [<ffffffffc04f1fe8>] ? sunrpc_cache_lookup+0x78/0x350 [sunrpc] [352691.807033] [<ffffffffc059aa20>] nfsd_open+0x40/0x1e0 [nfsd] [352691.808589] [<ffffffffc04f1679>] ? cache_check+0x69/0x340 [sunrpc] [352691.810126] [<ffffffffc059b2c7>] nfsd_read+0x47/0x100 [nfsd] [352691.811634] [<ffffffff810a4791>] ? groups_free+0x51/0x60 [352691.813120] [<ffffffffc05a415c>] nfsd3_proc_read+0xbc/0x150 [nfsd] [352691.814615] [<ffffffffc0595e78>] nfsd_dispatch+0xb8/0x200 [nfsd] [352691.816107] [<ffffffffc04e5eda>] svc_process_common+0x42a/0x690 [sunrpc] [352691.817623] [<ffffffffc04e72c3>] svc_process+0x103/0x1c0 [sunrpc] [352691.819121] [<ffffffffc05958cf>] nfsd+0xef/0x160 [nfsd] [352691.820589] [<ffffffffc05957e0>] ? nfsd_destroy+0x60/0x60 [nfsd] [352691.822024] [<ffffffff810a09d8>] kthread+0xd8/0xf0 [352691.823428] [<ffffffff810a0900>] ? kthread_create_on_node+0x1e0/0x1e0 [352691.824839] [<ffffffff8183640f>] ret_from_fork+0x3f/0x70 [352691.826222] [<ffffffff810a0900>] ? kthread_create_on_node+0x1e0/0x1e0 [352691.827587] Code: 08 00 48 85 db 74 18 48 8b 03 48 8b 7b 08 48 83 c3 18 4c 89 f6 ff d0 48 8b 03 48 85 c0 75 eb 49 8b 04 24 a8 01 0f 85 2a ff ff ff <0f> 0b 49 89 d0 48 c7 c6 e0 0b 9b c0 48 c7 c2 80 7a 9b c0 4c 89 [352691.831988] RIP [<ffffffffc0954566>] ocfs2_validate_inode_block+0x116/0x1e0 [ocfs2] [352691.834906] RSP <ffff88008d39f928> [352691.841295] ---[ end trace fce0701850d72ef8 ]---
Note that:
- At least two physical NICs are required in each host!
Local storage
Virtualisation servers will use replicated storage for most VMs. However, occasionally, local space is useful (e.g. for a test VM).
- Create LVs:
lvcreate --name=local --size=200g vg0
- Format for XFS, which offers online size changing:
mkfs -t xfs -f /dev/vg0/local
- Add fstab entries for them all as below, create mountpoints and mount them:
/dev/mapper/vg0-local /vol/local xfs auto,noatime,nodiratime 0 2
Persistent network device names
I use traditional NIC naming (i.e. eth0, eth1). This naming is not persistent (i.e. interfaces names may be rearranged after a reboot). But persistent naming is required. This procedure is to be run on both nodes.
- Edit /etc/udev/rules.d/70-persistent-net.rules to contain something like:
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="74:d4:35:54:87:0e", \ ATTR{dev_id}=="0x0", ATTR{type}=="1", NAME="eth0" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:0e:0c:c5:f0:6d", \ ATTR{dev_id}=="0x0", ATTR{type}=="1", NAME="eth1" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="10:fe:ed:05:92:6d", \ ATTR{dev_id}=="0x0", ATTR{type}=="1", NAME="eth2"
- Reboot.
Shared public network interface
The VMs that run on each node will need access to the public network interface.
- Reconfigure the stanza for etho in /etc/network/interfaces accordingly. E.g.:
iface eth0 inet manual auto br0 iface br0 inet static address 192.168.1.6 netmask 255.255.255.0 gateway 192.168.1.1 bridge_ports eth0
Dedicated network interface for cluster communications
It is essential to use a dedicated network card for cluster communications in order to ensure that public traffic does not impact replication.
- Add a suitable entry to /etc/network/interfaces for the NIC you will use for the cluster communications and add an entry for it to /etc/hosts. E.g.:
auto eth1 iface eth1 inet static address 192.168.3.6 netmask 255.255.255.0
- Reboot.
Replicated storage volumes
This procedure is to be run on both nodes, regardless of whether they are both being configured at the same time or not, unless explicitly stated otherwise.
- Create LVs:
lvcreate --name=vmpool0 --size=500g vg0 # I use this for disk images for VMs lvcreate --name=vmpool1 --size=2t vg0 # I use this for a large disk image for one particular VM
- For the first volume, add a dual-primary DRBD configuration in /etc/drbd.d/drbd-vmpool0.res, which references IPs on the dedicated cluster network:
resource drbd_vmpool0 { protocol C; device /dev/drbd_vmpool0 minor 0; meta-disk internal; disk /dev/mapper/vg0-vmpool0; net { allow-two-primaries; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } # Cannot be the backlan name, but should be the backlan IP. on fiori { address 192.168.3.6:7790; } # Cannot be the backlan name, but should be the backlan IP. on torchio { address 192.168.3.7:7790; } handlers { before-resync-target "/sbin/drbdsetup $DRBD_MINOR secondary"; split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } }
- Repeat for the other volumes being care to set the right resource name, device name, incremented device minor number, incremented TCP port numbers.
- If both nodes are available at the same time (which avoids long DRBD synchronisations), then:
- On both nodes run:
dd if=/dev/zero of=/dev/drbd_vmpool0 bs=64M drbdadm -- --force create-md drbd_vmpool0 drbdadm up drbd_small
- On one node only run:
drbdadm -- --clear-bitmap new-current-uuid drbd_vmpool0
- On both nodes run:
drbdadm primary drbd_vmpool0
otherwise, if both nodes are not available at the same time and this is the first node, then:
- Run:
drbdadm create-md drbd_vmpool0 drbdadm up drbd_vmpool0 drbdadm -- --overwrite-data-of-peer primary drbd_vmpool0
- Verify that the device is now primary:
fiori# cat /proc/drbd 0:drbd0 WFConnection Primary/Unknown UpToDate/DUnknown C r----s fiori#
otherwise, if both nodes were not available at the same time and this is the second node, then:
- On the second node run:
drbdadm create-md drbd_vmpool0 drbdadm up drbd_vmpool0
- Wait for synchronisation to complete; you can monitor this and get an estimate of when it will complete by running:
cat /proc/drbd
- Run:
drbdadm primary drbd0_vmpool0
- On both nodes run:
- Repeat the last step for the other volumes.
- Create /etc/ocfs2/cluster.conf, which references IPs on the dedicated cluster network, containing:
node: ip_port = 7777 ip_address = 192.168.3.6 number = 1 name = fiori cluster = ocfs2 node: ip_port = 7777 ip_address = 192.168.3.7 number = 2 name = torchio cluster = ocfs2 cluster: node_count = 2 name = ocfs2
- Run:
dpkg-reconfigure ocfs2-tools
and when prompted
Would you like to start an OCFS2 cluster (O2CB) at boot time?
select Yes and for all following questions accept the default.
- Run:
service o2cb start service o2cb online
- On one node only run:
-
mkfs.ocfs2 --fs-feature-level=max-features -T vmstore /dev/drbd_vmpool0
- Add an fstab entry as below, create a mountpoint and mount it:
/dev/drbd_vmpool0 /voll/small ocfs2 _netdev,noatime,data=writeback,commit=60,nodiratime 0 1
Hypervisors
This procedure is to be run on both nodes, regardless of whether they are both being configured at the same time or not, unless explicitly stated otherwise.
- Run:
apt-get install qemu-kvm libvirt-bin qemu-utils virt-top
- Define the storage pools using virsh; e.g.:
virsh pool-define-as --name=vmpool0 --type=dir \ --target=/vol/vmpool0 virsh pool-start vmpool0 virsh pool-define-as --name=isoimages --type=dir \ --target=/vol/pub/computing/software/isoimages/os virsh pool-start isoimages virsh pool-destroy default # unfortunately not persistent
- Define a network to allow co-hosted VMs to communicate directly using virsh; e.g.:
virsh net-destroy default # storage network for VMs (see above) virsh net-define <(cat <<EOF <network> <name>$192.168.10.0</name> <uuid>$(uuidgen)</uuid> <bridge name='virbr0' stp='on' delay='0'/> <mac address='52:54:00:81:cd:08'/> <ip address='192.168.10.1' netmask='255.255.255.0'> </ip> </network> EOF ) # No net definition required to plumb VMs into br0
- Set up SSH keys to allow the running of virt-manager from a remote system.
- If you have existing VM images and definitions to migrate, then migrate them now.
NFS services from a VM
This procedure is to be run on a single VM, not on the virtualisation servers!
In the list of LVs above, a 2TB volume is created on the virtualisation servers. This is to be given to a VM that will act as an NFS server.
- Make a filesystem on the large virtual disks, mount them, create subdirectories. Or use LVM to divide the storage up as you want.
- Due to Ubuntu bug 1558196, run the following commands:
systemctl add-wants multi-user.target rpcbind.service
(See https://askubuntu.com/questions/771319/in-ubuntu-16-04-not-start-rpcbind-on-boot for more details.)
- If an NFS client is a VM and it is running on the same physical host as the NFS server, then some performance increase can be gained by directing the NFS client to that NIC on the the NFS server that is on the shared virtual network. Therefore:
- Ensure the VM has a second interface connected to the virtual network that was created in the virtualisation servers earlier. For the sake of this procedure, let’s assume that the network is 192.168.10.0/24 and the servers will be 192.168.10.28.
- Note that, later, when creating other VMs:
- they will also need a second interface connected to the virtual network that was created in the virtualisation servers earlier.
- they should attempt to mount the NFS share first using the NFS server’s second interface and the fall back to the NFS server’s first interface, as in this example automounter entry:
pub -nordirplus,noatime,nodiratime,nfsvers=3,proto=tcp filer.pasta.net,fettuce.pasta.net:/vol/pub
- Write a suitable /etc/exports file. As an example here is my own:
/vol/small/home 192.168.1.0/24(rw,sync,no_root_squash,no_subtree_check) 192.168.10.9(rw,sync,no_root_squash,no_subtree_check) /vol/pub 192.168.1.0/24(rw,sync,no_root_squash,no_subtree_check) 192.168.10.9(rw,sync,no_root_squash,no_subtree_check) /vol/small/home 192.168.10.8(ro,no_root_squash,no_subtree_check) /vol/pub 192.168.10.8(ro,no_root_squash,no_subtree_check) /vol/small/svn 192.168.10.8(rw,sync,no_root_squash,no_subtree_check) /vol/small/mail 192.168.10.29(rw,sync,no_root_squash,no_subtree_check)
- Run:
exportfs -av
- For reasons I don’t understand, when I try to ‘svn commit’ then the NFS server logs:
lockd: cannot monitor <web-server-hostname>
The only fix I’ve been able to find for this is to include the following in the NFS client’s mount options (or in the auto.staging map):
...,nolock,...
SMB services from a VM
SMB is useful for allowing smartphones, Windows and Mac machines to transfer files (e.g. to put MP3s onto a smartphone).
- Run:
apt-get install samba
- Convert Unix accounts to SMB accounts as follows:
# pdbedit seems to have no way to pre-lock accounts so we'll use secure passwords pwgen() { dd if=/dev/urandom bs=1 count=100 2>/dev/null | base64 -w0; } # we'll need to extract login and fullname from entries in /etc/passwd or getent fanoutpwent() { perl -pe 's/^([^:]*):([^:]*):([^:]*):([^:]*):([^,]*),([^,]*),([^,]*),([^,]*):([^:]*):([^:]*)\n/"$1" "$2" "$3" "$4" "$5" "$6" "$7" "$8" "$9" "$10"\n/g;' <<<"$1"; } # generic function to run a shell command after getting ok to run it shi() { while read -r X; do eval set -- "$X"; read -p "$1: " YESNO < /dev/tty; [ "X$YESNO" != Xy ] || eval "$2"; done; } UID_MIN=$(sed -n 's/^UID_MIN[\t ]*//p' /etc/login.defs) UID_MAX=$(sed -n 's/^UID_MAX[\t ]*//p' /etc/login.defs) getent passwd | awk -F: "{ if ( \$3 >= $UID_MIN && \$3 <= $UID_MAX ) { print } }" | while read PWENT; do \ eval set -- $(fanoutpwent "$PWENT"); P=$(pwgen); echo "$1 '{ echo \"$P\"; echo \"$P\"; } | pdbedit --create --user "$1" --fullname \"$5\" --password-from-stdin'" done | shi
and follow the prompts regarding which accounts to create.
- Edit /etc/samba/smb.conf and set:
[global] ... # See http://www.spinics.net/lists/samba/msg69479.html strict locking = no # this doesn't work so don't bother uncommenting it #hide dot files = yes ... [homes] ... read only = no # this doesn't work so don't bother uncommenting it #hide dot files = yes ... [pub] comment = Public Archive browsable = yes path = /pub/ #[printers] #... #[print$] #...
- Run:
service samba reload
- Try to connect from a SMB client using smbclient as follows:
- Edit /etc/samba/smb.conf on the client and change:
syslog = 0
to
logging = syslog@0
(Without this you will see the warning message ‘WARNING: The “syslog” option is deprecated’. Note also that there is no need to make this change on the SMB server.)
- Run:
smbclient '\\fettuce\pub'
and
smbclient '\\fettuce\alexis'
- Edit /etc/samba/smb.conf on the client and change: