Configuring storage services generation three

Introduction

This page describes how Alexis Huxley installed and configures a replicated storage server.

The configuration described here is one developed during preparation for my Debian Wheezy rollout. The disk layout is as described there; i.e.:

Note:

two systems are configured in this manner, with replication of the indicated volumes via DRBD
each system contains only one 4TB disk – there is no RAID – if a disk dies then reinstallation will be required, but data will not be lost
data is shared out of subdirectories on relatively large volumes whose names (i.e. mountpoints) are derived from the underlying format of the volume (i.e. the device being mounted)
server-side mountpoints more appropriate for advertising are presented to clients by using symlinks to the actual server-side mountpoints
bidirectional replication of media files is done with unison, called from a cronjob
replicated data is shared to other machines via NFS, unreplicated data is not shared
replicated data is shared with the machine itself and the ‘other’ machine, by using OCFS2 filesystem on top of replicated devices

Pdi (part of MDI) partitions the disk using a preseeded partman recipe; so no action is required. It also creates LVs for root and swap. So all remaining space is available for storage volumes.

The commands and size below are derived from the diagram above.

Things to consider next time

This section is for the author to review before repeating this procedure.

Create the libvirt-qemu user and group before installing any packages; this way I can force the same UID and GID across all OCFS2 cluster members, who will see the same files but otherwise disagree about who owns them, as illustrated here:

rotini# ssh fiori ls -ld /vol/vmpool2/trennette.img
-rw------- 1 vde2-net libvirt-qemu 16777216000 Jan  1 16:54 /vol/vmpool2/trennette.img
rotini# ssh torchio ls -ld /vol/vmpool2/trennette.img
-rw------- 1 libvirt-qemu libvirt-qemu 16777216000 Jan  1 16:54 /vol/vmpool2/trennette.img
rotini#

Prologue

Run:

mkdir /vol
apt-get install drbd8-utils nfs-kernel-server ocfs2-tools
service drbd start

If you are using secondary groups on NFS clients (i.e. users belong to more than just their login group), as in this example:

chifferi$ id
uid=1000(alexis) gid=1000(alexis) groups=1000(alexis),1001(multimedia-editors)
chifferi$

then you may encounter LP#409366 and you should run the following:

perl -pi -e 's/^(RPCMOUNTDOPTS=).*/$1/' /etc/default/nfs-kernel-server
service nfs-kernel-server restart

De-chicken-and-egging

The VM server currently depends on the DNS server VM; this dependency needs to be broken.

Configure the system as a caching nameserver with no local referrer.
Modify /etc/resolv.conf to contain only:
```
nameserver 127.0.0.1
```
(Do not be tempted to add a domain or a list of search domains!)
Empty /etc/hosts of all but localhost, the host itself (without domain name) and other cluster nodes (also without domain name).
Verify that the following works:
```
ping <other-cluster-node-by-name>
```

Run:

apt-get update

If it fails with the error:

Err http://security.debian.org wheezy/updates Release.gpg
  Could not resolve 'security.debian.org'

then run:

service nscd restart

and try running apt-get again. (I encountered this on fideua and this procedure cleared it.)
If it fails with an error like this:

Err <url> wheezy/non-free i386 Packages
           406  Not Acceptable

then run:

cd /etc/apt/sources.list.d/
mkdir /etc/apt/sources.list.d/KEEP
mv /etc/apt/sources.list.d/*.list /etc/apt/sources.list.d/KEEP
for X in /etc/apt/sources.list.d/KEEP/*; do
    apt-get update > /dev/null 2>&1
    mv $X /etc/apt/sources.list.d/
done

and try running apt-get again. (I encountered this on gemelli and this procedure cleared it.)

Edit /etc/postfix/main.cf, specify the local mail relay by IP address and run:
```
postfix reload
postqueue -f
```
Edit /etc/apt/sources.list and /etc/apt/sources.list.d/*.list, specify the install server by IP address.

LVM

Run:

lvcreate --name=rep-no     --size=100G vg0
lvcreate --name=rep-drbd0  --size=200G vg0
lvcreate --name=rep-drbd1  --size=200G vg0
lvcreate --name=rep-unison --size=2T   vg0

At this stage the LVM configuration looks like this:

fiori# lvs
  LV         VG   Attr     LSize   Pool Origin Data%  Move Log Copy%  Convert
  rep-drbd0  vg0  -wi-a--- 200.00g
  rep-drbd1  vg0  -wi-a--- 200.00g
  rep-no     vg0  -wi-a--- 100.00g
  rep-unison vg0  -wi-a---   2.00t
  root       vg0  -wi-ao--  14.43g
  swap       vg0  -wi-ao--   3.72g
fiori# vgs
  VG   #PV #LV #SN Attr   VSize VFree
  vg0    1   6   0 wz--n- 3.64t 1.13t
fiori#

Unreplicated volumes

Run:

mkdir /vol/rep-no
mkfs -t ext4 /dev/vg0/rep-no
echo "/dev/mapper/vg0-rep--no /vol/rep-no   ext4  noatime,nodiratime 0 2" >> /etc/fstab
mount -a

Unison-replicated volumes

Run:

mkdir /vol/rep-unison
mkfs -t ext4 /dev/vg0/rep-unison
echo "/dev/mapper/vg0-rep--unison /vol/rep-unison ext4 noatime,nodiratime 0 2" >> /etc/fstab
mount -a

Nominate one machine as the client and the other as the server.

On the client run:

ssh-keygen -q -t dsa -N "" -f ~/.ssh/id_dsa_unison
cat ~/.ssh/id_dsa_unison.pub

On the server append the just displayed key to ~/.ssh/authorized_keys.

On the client run the following command and validate the output:

fiori# ssh -q -i ~/.ssh/id_dsa_unison -o StrictHostKeyChecking=no <server> echo OK
OK
fiori#

On the client, create ~/.unison/vol-rep-unison.prf containing:

root = /vol/rep-unison/
root = ssh://fiori//vol/rep-unison/
sshargs = -i /root/.ssh/id_dsa_unison
owner = true
group = true
perms = -1
times = true

Run the first sync manually by running:
```
unison vol-rep-unison
```
and follow the prompts.

To automate the synchronisation:

Edit ~/.unison/rep-unison.prf and add the following:

auto = true
batch = true
terse = true
confirmbigdel = true
fastcheck = true
prefer = newer

Run:

echo '0 0 * * * root unison vol-rep-unison' > /etc/cron.d/unison

To rotate logs:

Edit ~/.unison/rep-unison.prf and add:
```
logfile = /var/log/unison.log
```

Run:

echo -e '/var/log/unison.log {n        dailyn        compressn        rotate 28n}' > /etc/logrotate.d/unison

DRBD

If both nodes are available at the same time, then the syncing time can be considerably reduced. If this is not possible, it is still possible to make a “one legged” cluster and introduce the “second leg” later, but then a full sync will run and must complete before the volume is available on the second node.

On both systems, edit /etc/drbd.d/drbd0.res to contain:

resource drbd0 {
  protocol  C;
  device    /dev/drbd0 minor 0;
  meta-disk internal;
  disk      /dev/mapper/vg0-rep--drbd0;
  net {
    allow-two-primaries;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
  on fiori {
    address 192.168.1.6:7790;
  }
  on torchio {
    address 192.168.1.7:7790;
  }
  handlers {
    before-resync-target "/sbin/drbdsetup $DRBD_MINOR secondary";
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
  }
}

If both systems’ empty LVs for DRBD are available at the same time, then the DRBD synchronisation can be avoided, by configuring the volumes according to the following sub-procedure:
1. On both nodes run:
```
dd if=/dev/zero of=/dev/mapper/vg0-rep--drbd0 bs=64M
drbdadm -- --force create-md drbd0
drbdadm up drbd0
```
2. On one node only run:
```
drbdadm -- --clear-bitmap new-current-uuid drbd0
```
3. On both nodes run:
```
drbdadm primary drbd0
```
If only one node is initially available then a one-legged-and-then-later-two-legged configuration can be done according to the following sub-procedure:
1. On the “first leg” (i.e. the currently available machine), run:
```
drbdadm create-md drbd0
drbdadm up drbd0
drbdadm -- --overwrite-data-of-peer primary drbd0
```
2. Verify that the device is now primary:
```
fiori# cat /proc/drbd
  0:drbd0  WFConnection Primary/Unknown UpToDate/DUnknown C r----s
fiori#
```
3. You may start OCFS2 configuration (see below); once the “second leg” becomes available then you can continue to the next step of this section.
4. On the “second leg” run:
```
drbdadm create-md drbd0
drbdadm up drbd0
```
5. Wait for synchronisation to complete; you can monitor it by running:
```
cat /proc/drbd
```
6. Run:
```
drbdadm primary drbd0
```

Repeat the above procedure for drbd1, where /etc/drbd.d/drbd0.res will contain:

resource drbd1 {
  protocol  C;
  device    /dev/drbd1 minor 1;
  meta-disk internal;
  disk      /dev/mapper/vg0-rep--drbd1;
  net {
    allow-two-primaries;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
  on fiori {
    address 192.168.1.6:7791;
  }
  on torchio {
    address 192.168.1.7:7791;
  }
  handlers {
    before-resync-target "/sbin/drbdsetup $DRBD_MINOR secondary";
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
  }
}

OCFS2 cluster

Create /etc/ocfs2/cluster.conf, containing:

node:
        ip_port = 7777
        ip_address = 192.168.1.6
        number = 1
        name = fiori
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 192.168.1.7
        number = 2
        name = torchio
        cluster = ocfs2

cluster:
        node_count = 2
        name = ocfs2

Run:

dpkg-reconfigure ocfs2-tools

and when prompted

Would you like to start an OCFS2 cluster (O2CB) at boot time?

select Yes and for all following questions accept the default.

Run:
```
service o2cb start
service o2cb online
```
Repeat this procedure on the other node.

OCFS2 volumes

Run:

MODE=mail                   #  For many small files
MNTPNT=/vol/ocfs2-small     #  For many small files
DEVICE=/dev/drbd0           #  For many small files

On one node only, run:

mkfs.ocfs2 --fs-feature-level=max-features -T $MODE $DEVICE

On both nodes run:

mkdir $MNTPNT
echo "$DEVICE $MNTPNT ocfs2  _netdev,noatime,data=writeback,commit=60,nodiratime 0 1" >> /etc/fstab
mount -a

Run:

MODE=vmstore                #  For VM images
MNTPNT=/vol/ocfs2-large     #  For VM images
DEVICE=/dev/drbd1           #  For VM images

On one node only, run:

mkfs.ocfs2 --fs-feature-level=max-features -T $MODE $DEVICE

On both nodes run:

mkdir $MNTPNT
echo "$DEVICE $MNTPNT ocfs2  _netdev,noatime,data=writeback,commit=60,nodiratime 0 1" >> /etc/fstab
mount -a

Public mountpoints and NFS

Different steps in this procedure are to be run on different machines; see below for more details.

On both nodes of the storage cluster, run:

mkdir /vol/rep-no/scratch
ln -s /vol/rep-no/scratch /vol/scratch
mkdir /vol/rep-no/vmpool0
ln -s /vol/rep-no/vmpool0 /vol/vmpool0
mkdir /vol/rep-unison/pub
ln -s /vol/rep-unison/pub /vol/pub
mkdir /vol/ocfs2-small/home
ln -s /vol/ocfs2-small/home /vol/home
mkdir /vol/ocfs2-small/mail
ln -s /vol/ocfs2-small/mail /vol/mail
mkdir /vol/ocfs2-small/wikis
ln -s /vol/ocfs2-small/wikis /vol/wikis
mkdir /vol/ocfs2-small/svn
ln -s /vol/ocfs2-small/svn /vol/svn
mkdir /vol/ocfs2-large/vmpool1
ln -s /vol/ocfs2-large/vmpool1 /vol/vmpool1
mount -a

On both nodes of the storage cluster, run:

echo "/vol/pub           *(rw,no_root_squash,sync,subtree_check)" >> /etc/exports
echo "/vol/home          *(rw,no_root_squash,sync,subtree_check)" >> /etc/exports
echo "/vol/mail          *(rw,no_root_squash,sync,subtree_check)" >> /etc/exports
echo "/vol/wikis         *(rw,no_root_squash,sync,subtree_check)" >> /etc/exports
echo "/vol/svn           *(rw,no_root_squash,sync,subtree_check)" >> /etc/exports

On both nodes of the storage cluster, run:
```
service nfs-kernel-server start
```
(This is needed because the exports file was empty when nfs-kernel-server was installed, nfs-kernel-server did not start. Ordinarily it is enough to run exportfs -av.)

On the NIS server, add entries like the following entry to /etc/auto.home:

alexis -noquota,noatime,nodiratime,nfsvers=3 fiori.pasta.net,torchio.pasta.net:/vol/home/alexis
...

and add entries like the following entry to /etc/auto.staging:

pub -noquota,noatime,nodiratime,nfsvers=3 fiori.pasta.net,torchio.pasta.net:/vol/pub

and then run:

make -C /var/yp

On all NFS clients run:
```
mkdir /staging
ln -s staging/pub /pub
```

Tuning

As a workaround for BTS#666021, run:

sysctl -w vm.min_free_kbytes=65536
echo "vm.min_free_kbytes=65536" >> /etc/sysctl.conf

As a workaround for the latency introduced by Powerline equipment:

Add the following to /etc/drbd.d/global_common.conf:

common {
    ...
    startup {
        ...
        wfc-timeout 120;
    }
    ...
    net {
        ...
        ping-int 30;
        ping-timeout 60;
    }
    ...
}

Change the timeouts on /etc/default/o2cb to be:

O2CB_ENABLED=true
O2CB_BOOTCLUSTER=ocfs2
O2CB_HEARTBEAT_THRESHOLD=61
O2CB_IDLE_TIMEOUT_MS=60000
O2CB_KEEPALIVE_DELAY_MS=4000
O2CB_RECONNECT_DELAY_MS=4000

Due to BTS#476545, it is necessary to move all OCFS2-related mounts from being done by /etc/init.d/mountnfs.sh to elsewhere. At first I tried putting it all in /etc/rc.local but there were inconsistencies in the availability of resources (DBRD modes, etc) that made this too inflexible. In the end I used a script:

#!/bin/bash
set -e
PROGNAME=$(basename $0)

#  Configuration
RESOURCES="drbd0 drbd1"

#  Globals
OCFS2_MOUNT_OPTS="_netdev,noauto,noatime,data=writeback,commit=60,nodiratime"

#  Message functions
info()    { echo "$PROGNAME: INFO: $1"    >&2;         }
warning() { echo "$PROGNAME: WARNING: $1" >&2;         }
error()   { echo "$PROGNAME: ERROR: $1"   >&2; exit 1; }

pool_status() { local P S; P=$1; S=$(virsh pool-list | grep $P | awk '{ print $2 }'); info "pool $P has status $S"; echo "$S"; }

#  rep-unison
if true; then
   info "rep-unison ..."
   exportfs -a 2>/dev/null || true
fi

#  rep-no
if true; then
   info "rep-no ..."
   [ "X$(pool_status vmpool0)" = Xactive ] || virsh pool-start vmpool0
fi

#  drbd - promote
if true; then
   info "drbdX (promote) ..."
   for RESOURCE in $RESOURCES; do
       #  Promote?
       ROLE=$(drbdadm role $RESOURCE | sed 's@/.*@@')
       [ $ROLE = Primary ] || drbdadm primary $RESOURCE
       #  Check unmounted.
       MOUNTED_FLAG=$(cat /proc/mounts | awk '{ print $1 }' | fgrep -x /dev/$RESOURCE && echo true || echo false)
       ! $MOUNTED_FLAG || error "/dev/$RESOURCE: unexpectedly already mounted"
       #  Don't fsck, because it may be mounted on the other node.
       ##  Fsck and mount.
       #fsck -f /dev/$RESOURCE
   done
fi

#  drbd - mount
if true; then
   info "drbdX (mount) ..."
   for RESOURCE in $RESOURCES; do
       #  Promote?
       case $RESOURCE in
           drbd0) MNTPNT=/vol/ocfs2-small ;;
           drbd1) MNTPNT=/vol/ocfs2-large ;;
       esac
       mount -o $OCFS2_MOUNT_OPTS /dev/$RESOURCE $MNTPNT
   done
fi

#  drbd0
if true; then
   info "drbd0 (specific) ..."
   exportfs -a 2>/dev/null || true
fi

#  drbd1
if true; then
   info "drbd1 (specific) ..."
   [ "X$(pool_status vmpool1)" = Xactive ] || virsh pool-start vmpool1
fi

#  This export should work!
exportfs -av

To make kernel panics trigger reboots, run:

echo "kernel.panic = 10" >> /etc/sysctl.conf
sysctl -w kernel.panic=10

Switching to a back-to-back link for DBBD and OCFS2 traffic

With the insertion of a cheap 1GB NIC in both fiori and torchio and a network cable between them, then DRBD reliability is much greater than when traffic is routed through the Fritz!Box.

On both nodes, edit /etc/network/interfaces and add a suitable stanza for new interface eth1.
On both nodes, bring up the new interfaces.
On one node, verify that the other node’s eth1 interface is pingable.
On both nodes. stop whatever is necessary in order to unmount OCFS2 filesystems.
On both nodes, unmount OCFS2 filesystems.
On both nodes, down all DRBD devices.
On both nodes, modify the IP addresses referenced in /etc/drbd.d/*.res to use the back-to-back link instead.
On both nodes, up all DRBD devices and make them primary.
On both nodes, modify the IP addresses referenced in /etc/ocfs2/cluster.conf to use the back-to-back link instead.
On both nodes, mount OCFS2 filesystems.
On both nodes, edit /etc/yp.conf and change this:
```
domain pasta.net broadcast
```
to this:
```
domain pasta.net server localhost
```
(This is needed because the IP address of the other node’s eth1 interface may reply before localhost does! Yes, really!)

Introduction

Things to consider next time

Prologue

De-chicken-and-egging

LVM

Unreplicated volumes

Unison-replicated volumes

DRBD

OCFS2 cluster

OCFS2 volumes

Public mountpoints and NFS

Tuning

Switching to a back-to-back link for DBBD and OCFS2 traffic

See also