Cgroup procedures – www.pasta.freemyip.com

Introduction

This page logs various procedures I read or wrote while learning about cgroups.

Cgroups is a kernel facility providing subsystems to control the access that a user-defined group of processes have to a resource; e.g. the “memory” subsystem controls the amount of memory that a group of processes may use.

Cgroups subsystems are managed via a virtual filesystem (like /proc). Look at this example:

banana# mount | grep 'type cgroup'
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,clone_children)
banana#

Note:

the resource-controlling subsystems are specified as mount options, not as the device being mount; the device being mounted is always ‘cgroup’
each subsystem can only be managed through one mountpoint (according to rules here)
but multiple subsystems may be managed through one mountpoint (e.g. cpu and cpuacct are both managed via the /sys/fs/cgroup/cpu,cpuacct mointpoint) but, still, each subsystem appears, at most, once in the complete list of mount options
by convention, the mountpoint name is derived from the list of subsystems that it is managing
to allow a little more flexibility, subdirectories may be created under each mountpoint (with the list of filenames in the mountpoint being automatically propagated into its subdirectories); this does not change the subsystems exposed to a group of processes, but it does allow different limits to be applied to different groups of processes
the top-level cgroup under a mount is for all processes – this is the ‘root cgroup’; those further down the hierarchy of cgroups may contain subsets of those specified in the parent cgroup
depending on your distribution and what you have installed, if you run the above command, you may see similar, different or empty output

Packages for managing cgroups include:

cgroup-lite (simple package for Debian/Ubuntu to provide per-subsystem mountpoints at boot-time)

Procedures

Procedure: prologue

Ensure that the kernel command line includes:
```
swapaccount=1
```
(Without this, limiting memory usage will not work as described below.)
See what cgroup subsystems are provided by your kernel:
```
cat /proc/cgroups
```
Depending what cgroup management tools are already installed on your system, you may or may not already see

Procedure: limiting resources of recently started processes

This procedure is not-reboot safe; making it reboot safe might later be added below.

Select a name for the group of processes you will be controlling (e.g. “ssh-procs”, “procs-run-as-user-fred”, “my-test-procs”).
Select the set of cgroup subsystems (e.g. CPU, memory) whose usage by a group of processes you will be asking the cgroup subsystem (e.g. cpu, cpuacct, memory) to limit.

Create a mountpoint for the set of subsystems and mount the set:

mkdir /sys/fs/cgroup/devices,memory,cpu,cpuacct
mount -t cgroup -o devices,memory,cpu,cpuacct cgroup /sys/fs/cgroup/devices,memory,cpu,cpuacct

Create a subdirectory to allow specific limits to be applied to a specific set of processes (the cgroup!) for the resources:
```
mkdir /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs
```

If you do not have a suitable process to manipulate, then compile this test program, whicg is referred to the in the output below as a.out:

#include <stdlib.h>   /* for malloc() */
#include <unistd.h>   /* for free() */
#include <stdio.h>    /* for printf() */
#include <errno.h>    /* for errno */
#include <error.h>    /* for error() */
#include <strings.h>  /* for rindex() */
#include <time.h>     /* for nanosleep() */

# define usage() error(1,0,"Usage: %s <interval> <chunk-size> <progress-char>", argv[0])

char *progname;

main(
int argc,
char *argv[])
{
    int i, j, chunksize;
    float interval;
    char *progress_string;
    struct timespec interval_timespec;
    void *mp;

    /* process arguments */
    progname = rindex(argv[0], '/');
    if (argc != 4)
            usage();
    errno = 0;
    interval = atof(argv[1]);
    interval_timespec.tv_sec = (time_t) interval;
    interval_timespec.tv_nsec = (interval - (int) interval) * 1000000000;
    if (errno)
        usage();
    chunksize = atoi(argv[2]);
    if (errno)
        usage();
    if (chunksize < 0)
        usage();
    progress_string = argv[3];

    /* don't buffer stdout */
    setbuf(stdout, NULL);

    /* loop */
    for (i=0; ; ) {
        /* optimisation: don't call malloc() if malloc()ing zero bytes */
        if (chunksize > 0 && (mp=malloc(chunksize)) == NULL)
            error(1,0,"%s%s: internal error; malloc() failed\n", progname, progress_string);
        /* populate as Linux's malloc() is opportunistic */
        for (j=0; j<chunksize; j++)
             *(unsigned char *)(mp+j) = (unsigned char) j%256;
        i++;
        printf("%s", progress_string);
        /* optimisation: don't call (very expensive) nanosleep() if nanosleep()ing zero */
        if (interval > 0.0)
            nanosleep(&interval_timespec, NULL);
    }
}

Run the test program, assign it to the cgroup and then apply a limit of 100MB memory:

#  Every 1s get ~5MB memory, print '+' after each, bg it so pid in $!
./a.out 1 5000000 + &
echo $! > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cgroup.procs

#  Set memory limit (order important)
echo 100M > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.limit_in_bytes
echo 100M > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.memsw.limit_in_bytes

#  Wait for shell to report that a.out got killed (kill -9, that is, i.e. untrappable)
...

#  Clear memory limit (order important)
echo -1 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.memsw.limit_in_bytes
echo -1 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.limit_in_bytes

Note that:

you cannot set memory.memsw.limit_in_bytes without setting memory.limit_in_bytes first. If you try to then you get:
```
-bash: echo: write error: Invalid argument
```
you cannot set a limit after a process has exceeded that limit. If you try then you get:
```
-bash: echo: write error: Device or resource busy
```
an explanation of the tunable parameters can be found here

Run the test program, assign it to the cgroup and then apply a limit of 20% CPU:

#  Every 0s get 0 bytes memory, print nothing after each, bg it so pid in $!
./a.out 0 0 '' &
echo $! >/sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cgroup.procs

#  Note run interval (every 0.1ms) and allowed runtime per interval (unrestricted)
cat /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_period_us
100000
cat /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_quota_us
-1

#  See it using 100% CPU
top -n 1 -b | fgrep a.out
 5275 alexis    20   0    4188    356    276 R 99.3  0.0   7:08.11 a.out

#  Calculate 20% of run interval and set that as allowed runtime per interval
echo 20000 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_quota_us

#  See it using less than 20% CPU
top -n 1 -b | fgrep a.out
 5275 alexis    20   0    4188    356    276 R 13.2  0.0   7:03.22 a.out

#  Clear CPU limit
echo -1 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_quota_us

#  See it using 100% CPU again
top -n 1 -b | fgrep a.out
 5275 alexis    20   0    4188    356    276 R 99.3  0.0   7:07.17 a.out

Note that:

an explanation of the tunable parameters can be found here
in the above two steps (for memory and CPU control respectively), two subsystems were managed via one mountpoint; in practice, this makes cgroup administration very difficult; we should use one mountpoint per subsystem.

Run the test program, assign it to the cgroup and then “freeze” (i.e. suspend) and “thaw” (i.e. allow to run) it:

#  Every 1s get 0 bytes memory, print '+' after each, bg it so pid in $!
./a.out 1 0 '+' &
mkdir /sys/fs/cgroup/freezer/my-procs
echo $! > /sys/fs/cgroup/freezer/my-test-procs/cgroup.procs
#  Note how process is currently thawed
cat /sys/fs/cgroup/freezer/my-test-procs/freezer.state
THAWED

#  Freeze the process
echo FROZEN > /sys/fs/cgroup/freezer/my-test-procs/freezer.state

#  Check process is frozen
...

#  Thaw the process
echo THAWED > /sys/fs/cgroup/freezer/my-test-procs/freezer.state

#  Check process has thawed
...

Note that:

the freezer subsystem was already in use by a mount (probably triggered by installing cgroups-bin), so I had to make use of the existing access path, rather than preparing the mount myself, hence the path being different to the previous steps
the shell does not detect that the process has been frozen (freezing doesn’t use signals)
there are no tunable parameters for this subsystem

Procedure: switching to one mountpoint per subsystem

Debian/Ubuntu: install cgroup-lite and reboot.

Procedure: verifying that a process can only be a member of one process group in a hierarchy

Run

#  we need a 'cat' command that does not create new processes. Implement it as a function
cat() { while read LINE; do echo "$LINE"; done; }

#  Start a process
sleep 1000 &
PID=$!
echo $PID
#  Record the pids in the root group
cat < /sys/fs/cgroup/freezer/cgroup.procs > before.pids

#  Add the new process to a (non-root) cgroup
echo $PID > /sys/fs/cgroup/freezer/my-test-procs/cgroup.procs

#  Rerecord the pids in the root cgroup
cat < /sys/fs/cgroup/freezer/cgroup.procs > after.pids

#  Look! The new pid was *moved* from one cgroup to another, not copied
diff before.pids after.pids
70d69
< 5849

We can repeat this with a new cgroup inside the cgroup:

#  Create a sub-cgroup
mkdir /sys/fs/cgroup/freezer/my-test-procs/my-sub-test-procs

#  Put the process in it
echo $PID > /sys/fs/cgroup/freezer/my-test-procs/my-sub-test-procs/cgroup.procs

#  The root cgroup already did not have the pid in it so was unaffected
cat < /sys/fs/cgroup/freezer/cgroup.procs > after2.pids
diff after.pids after2.pids

#  But look, it has been moved from my-test-procs to my-test-procs/my-sub-test-procs
cat < /sys/fs/cgroup/freezer/my-test-procs/cgroup.procs
cat < /sys/fs/cgroup/freezer/my-test-procs/my-sub-test-procs/cgroup.procs
5849

Read this.

Procedure: enforcing resource limits

Verify that the program cgrulesengd is installed; install if necessary.
Edit /etc/cgrules.conf (it may not exist already) and add the following:
```
*:a.out        cpu      /my-test-group
```

Run:

cgrules -d
CGroup Rules Engine Daemon log started
Current time: Fri Jun 12 10:55:39 2015

Opened log file: -, log facility: 0, log level: 7
Proceeding with PID 1121
Rule: *:a.out
  UID: any
  GID: any
  DEST: /my-test-group
  CONTROLLERS:
    cpu

Started the CGroup Rules Engine Daemon.

Now run a.out as follows:
```
./a.out 0 0 '' &
```
Use top to monitor its CPU consumption, which should be around 20% 🙂
Note:
- cgrulesengd does not start automatically; a init.d or systemd script is needed.

Questions

Why does ‘mount’ output not show all cgroup mountpoints? Catting /proc/mounts does. E.g.:

test01# mount | grep cgroup/
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
cgroup on /sys/fs/cgroup/devices,memory,cpu,cpuacct type cgroup (rw,devices,memory,cpu,cpuacct)
test01# grep cgroup/ /proc/mounts
systemd /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,name=systemd 0 0
cgroup /sys/fs/cgroup/devices,memory,cpu,cpuacct cgroup rw,relatime,devices,memory,cpuacct,cpu 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,relatime,hugetlb 0 0
test01#

Introduction

Procedures

Procedure: prologue

Procedure: limiting resources of recently started processes

Procedure: switching to one mountpoint per subsystem

Procedure: verifying that a process can only be a member of one process group in a hierarchy

Procedure: enforcing resource limits

Questions

See also