Cgroup procedures

Introduction

This page logs various procedures I read or wrote while learning about cgroups.

Cgroups is a kernel facility providing subsystems to control the access that a user-defined group of processes have to a resource; e.g. the “memory” subsystem controls the amount of memory that a group of processes may use.

Cgroups subsystems are managed via a virtual filesystem (like /proc). Look at this example:

banana# mount | grep 'type cgroup'
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,clone_children)
banana#

Note:

  • the resource-controlling subsystems are specified as mount options, not as the device being mount; the device being mounted is always ‘cgroup’
  • each subsystem can only be managed through one mountpoint (according to rules here)
  • but multiple subsystems may be managed through one mountpoint (e.g. cpu and cpuacct are both managed via the /sys/fs/cgroup/cpu,cpuacct mointpoint) but, still, each subsystem appears, at most, once in the complete list of mount options
  • by convention, the mountpoint name is derived from the list of subsystems that it is managing
  • to allow a little more flexibility, subdirectories may be created under each mountpoint (with the list of filenames in the mountpoint being automatically propagated into its subdirectories); this does not change the subsystems exposed to a group of processes, but it does allow different limits to be applied to different groups of processes
  • the top-level cgroup under a mount is for all processes – this is the ‘root cgroup’; those further down the hierarchy of cgroups may contain subsets of those specified in the parent cgroup
  • depending on your distribution and what you have installed, if you run the above command, you may see similar, different or empty output

Packages for managing cgroups include:

  • cgroup-lite (simple package for Debian/Ubuntu to provide per-subsystem mountpoints at boot-time)

Procedures

Procedure: prologue

  1. Ensure that the kernel command line includes:
    swapaccount=1

    (Without this, limiting memory usage will not work as described below.)

  2. See what cgroup subsystems are provided by your kernel:
    cat /proc/cgroups
  3. Depending what cgroup management tools are already installed on your system, you may or may not already see

Procedure: limiting resources of recently started processes

This procedure is not-reboot safe; making it reboot safe might later be added below.

  1. Select a name for the group of processes you will be controlling (e.g. “ssh-procs”, “procs-run-as-user-fred”, “my-test-procs”).
  2. Select the set of cgroup subsystems (e.g. CPU, memory) whose usage by a group of processes you will be asking the cgroup subsystem (e.g. cpu, cpuacct, memory) to limit.
  3. Create a mountpoint for the set of subsystems and mount the set:
    mkdir /sys/fs/cgroup/devices,memory,cpu,cpuacct
    mount -t cgroup -o devices,memory,cpu,cpuacct cgroup /sys/fs/cgroup/devices,memory,cpu,cpuacct
  4. Create a subdirectory to allow specific limits to be applied to a specific set of processes (the cgroup!) for the resources:
    mkdir /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs
    
  5. If you do not have a suitable process to manipulate, then compile this test program, whicg is referred to the in the output below as a.out:
    #include <stdlib.h>   /* for malloc() */
    #include <unistd.h>   /* for free() */
    #include <stdio.h>    /* for printf() */
    #include <errno.h>    /* for errno */
    #include <error.h>    /* for error() */
    #include <strings.h>  /* for rindex() */
    #include <time.h>     /* for nanosleep() */
    
    # define usage() error(1,0,"Usage: %s <interval> <chunk-size> <progress-char>", argv[0])
    
    char *progname;
    
    main(
    int argc,
    char *argv[])
    {
        int i, j, chunksize;
        float interval;
        char *progress_string;
        struct timespec interval_timespec;
        void *mp;
    
        /* process arguments */
        progname = rindex(argv[0], '/');
        if (argc != 4)
                usage();
        errno = 0;
        interval = atof(argv[1]);
        interval_timespec.tv_sec = (time_t) interval;
        interval_timespec.tv_nsec = (interval - (int) interval) * 1000000000;
        if (errno)
            usage();
        chunksize = atoi(argv[2]);
        if (errno)
            usage();
        if (chunksize < 0)
            usage();
        progress_string = argv[3];
    
        /* don't buffer stdout */
        setbuf(stdout, NULL);
    
        /* loop */
        for (i=0; ; ) {
            /* optimisation: don't call malloc() if malloc()ing zero bytes */
            if (chunksize > 0 && (mp=malloc(chunksize)) == NULL)
                error(1,0,"%s%s: internal error; malloc() failed\n", progname, progress_string);
            /* populate as Linux's malloc() is opportunistic */
            for (j=0; j<chunksize; j++)
                 *(unsigned char *)(mp+j) = (unsigned char) j%256;
            i++;
            printf("%s", progress_string);
            /* optimisation: don't call (very expensive) nanosleep() if nanosleep()ing zero */
            if (interval > 0.0)
                nanosleep(&interval_timespec, NULL);
        }
    }
  6. Run the test program, assign it to the cgroup and then apply a limit of 100MB memory:
    #  Every 1s get ~5MB memory, print '+' after each, bg it so pid in $!
    ./a.out 1 5000000 + &
    echo $! > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cgroup.procs
    
    #  Set memory limit (order important)
    echo 100M > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.limit_in_bytes
    echo 100M > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.memsw.limit_in_bytes
    
    #  Wait for shell to report that a.out got killed (kill -9, that is, i.e. untrappable)
    ...
    
    #  Clear memory limit (order important)
    echo -1 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.memsw.limit_in_bytes
    echo -1 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.limit_in_bytes

    Note that:

    • you cannot set memory.memsw.limit_in_bytes without setting  memory.limit_in_bytes first. If you try to then you get:
      -bash: echo: write error: Invalid argument
    • you cannot set a limit after a process has exceeded that limit. If you try then you get:
      -bash: echo: write error: Device or resource busy
    • an explanation of the tunable parameters can be found here
  7. Run the test program, assign it to the cgroup and then apply a limit of 20% CPU:
    #  Every 0s get 0 bytes memory, print nothing after each, bg it so pid in $!
    ./a.out 0 0 '' &
    echo $! >/sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cgroup.procs
    
    #  Note run interval (every 0.1ms) and allowed runtime per interval (unrestricted)
    cat /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_period_us
    100000
    cat /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_quota_us
    -1
    
    #  See it using 100% CPU
    top -n 1 -b | fgrep a.out
     5275 alexis    20   0    4188    356    276 R 99.3  0.0   7:08.11 a.out
    
    #  Calculate 20% of run interval and set that as allowed runtime per interval
    echo 20000 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_quota_us
    
    #  See it using less than 20% CPU
    top -n 1 -b | fgrep a.out
     5275 alexis    20   0    4188    356    276 R 13.2  0.0   7:03.22 a.out
    
    #  Clear CPU limit
    echo -1 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_quota_us
    
    #  See it using 100% CPU again
    top -n 1 -b | fgrep a.out
     5275 alexis    20   0    4188    356    276 R 99.3  0.0   7:07.17 a.out

    Note that:

    • an explanation of the tunable parameters can be found here
    • in the above two steps (for memory and CPU control respectively), two subsystems were managed via one mountpoint; in practice, this makes cgroup administration very difficult; we should use one mountpoint per subsystem.
  8. Run the test program, assign it to the cgroup and then “freeze” (i.e. suspend) and “thaw” (i.e. allow to run) it:
    #  Every 1s get 0 bytes memory, print '+' after each, bg it so pid in $!
    ./a.out 1 0 '+' &
    mkdir /sys/fs/cgroup/freezer/my-procs
    echo $! > /sys/fs/cgroup/freezer/my-test-procs/cgroup.procs
    #  Note how process is currently thawed
    cat /sys/fs/cgroup/freezer/my-test-procs/freezer.state
    THAWED
    
    #  Freeze the process
    echo FROZEN > /sys/fs/cgroup/freezer/my-test-procs/freezer.state
    
    #  Check process is frozen
    ...
    
    #  Thaw the process
    echo THAWED > /sys/fs/cgroup/freezer/my-test-procs/freezer.state
    
    #  Check process has thawed
    ...

    Note that:

    • the freezer subsystem was already in use by a mount (probably triggered by installing cgroups-bin), so I had to make use of the existing access path, rather than preparing the mount myself, hence the path being different to the previous steps
    • the shell does not detect that the process has been frozen (freezing doesn’t use signals)
    • there are no tunable parameters for this subsystem

Procedure: switching to one mountpoint per subsystem

  1. Debian/Ubuntu: install cgroup-lite and reboot.

Procedure: verifying that a process can only be a member of one process group in a hierarchy

  1. Run
    #  we need a 'cat' command that does not create new processes. Implement it as a function
    cat() { while read LINE; do echo "$LINE"; done; }
    
    #  Start a process
    sleep 1000 &
    PID=$!
    echo $PID
    #  Record the pids in the root group
    cat < /sys/fs/cgroup/freezer/cgroup.procs > before.pids
    
    #  Add the new process to a (non-root) cgroup
    echo $PID > /sys/fs/cgroup/freezer/my-test-procs/cgroup.procs
    
    #  Rerecord the pids in the root cgroup
    cat < /sys/fs/cgroup/freezer/cgroup.procs > after.pids
    
    #  Look! The new pid was *moved* from one cgroup to another, not copied
    diff before.pids after.pids
    70d69
    < 5849
  2. We can repeat this with a new cgroup inside the cgroup:
    #  Create a sub-cgroup
    mkdir /sys/fs/cgroup/freezer/my-test-procs/my-sub-test-procs
    
    #  Put the process in it
    echo $PID > /sys/fs/cgroup/freezer/my-test-procs/my-sub-test-procs/cgroup.procs
    
    #  The root cgroup already did not have the pid in it so was unaffected
    cat < /sys/fs/cgroup/freezer/cgroup.procs > after2.pids
    diff after.pids after2.pids
    
    #  But look, it has been moved from my-test-procs to my-test-procs/my-sub-test-procs
    cat < /sys/fs/cgroup/freezer/my-test-procs/cgroup.procs
    cat < /sys/fs/cgroup/freezer/my-test-procs/my-sub-test-procs/cgroup.procs
    5849
  3. Read this.

Procedure: enforcing resource limits

  1. Verify that the program cgrulesengd is installed; install if necessary.
  2. Edit /etc/cgrules.conf (it may not exist already) and add the following:
    *:a.out        cpu      /my-test-group
  3. Run:
    cgrules -d
    CGroup Rules Engine Daemon log started
    Current time: Fri Jun 12 10:55:39 2015
    
    Opened log file: -, log facility: 0, log level: 7
    Proceeding with PID 1121
    Rule: *:a.out
      UID: any
      GID: any
      DEST: /my-test-group
      CONTROLLERS:
        cpu
    
    Started the CGroup Rules Engine Daemon.
  4. Now run a.out as follows:
    ./a.out 0 0 '' &
  5. Use top to monitor its CPU consumption, which should be around 20% 🙂
  6. Note:
    • cgrulesengd does not start automatically; a init.d or systemd script is needed.

Questions

  1. Why does ‘mount’ output not show all cgroup mountpoints? Catting /proc/mounts does. E.g.:
    test01# mount | grep cgroup/
    systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
    cgroup on /sys/fs/cgroup/devices,memory,cpu,cpuacct type cgroup (rw,devices,memory,cpu,cpuacct)
    test01# grep cgroup/ /proc/mounts
    systemd /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,name=systemd 0 0
    cgroup /sys/fs/cgroup/devices,memory,cpu,cpuacct cgroup rw,relatime,devices,memory,cpuacct,cpu 0 0
    cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
    cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0
    cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0
    cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event 0 0
    cgroup /sys/fs/cgroup/hugetlb cgroup rw,relatime,hugetlb 0 0
    test01#

See also