Introduction
This page logs various procedures I read or wrote while learning about cgroups.
Cgroups is a kernel facility providing subsystems to control the access that a user-defined group of processes have to a resource; e.g. the “memory” subsystem controls the amount of memory that a group of processes may use.
Cgroups subsystems are managed via a virtual filesystem (like /proc). Look at this example:
banana# mount | grep 'type cgroup' cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,clone_children) banana#
Note:
- the resource-controlling subsystems are specified as mount options, not as the device being mount; the device being mounted is always ‘cgroup’
- each subsystem can only be managed through one mountpoint (according to rules here)
- but multiple subsystems may be managed through one mountpoint (e.g. cpu and cpuacct are both managed via the /sys/fs/cgroup/cpu,cpuacct mointpoint) but, still, each subsystem appears, at most, once in the complete list of mount options
- by convention, the mountpoint name is derived from the list of subsystems that it is managing
- to allow a little more flexibility, subdirectories may be created under each mountpoint (with the list of filenames in the mountpoint being automatically propagated into its subdirectories); this does not change the subsystems exposed to a group of processes, but it does allow different limits to be applied to different groups of processes
- the top-level cgroup under a mount is for all processes – this is the ‘root cgroup’; those further down the hierarchy of cgroups may contain subsets of those specified in the parent cgroup
- depending on your distribution and what you have installed, if you run the above command, you may see similar, different or empty output
Packages for managing cgroups include:
- cgroup-lite (simple package for Debian/Ubuntu to provide per-subsystem mountpoints at boot-time)
Procedures
Procedure: prologue
- Ensure that the kernel command line includes:
swapaccount=1
(Without this, limiting memory usage will not work as described below.)
- See what cgroup subsystems are provided by your kernel:
cat /proc/cgroups
- Depending what cgroup management tools are already installed on your system, you may or may not already see
Procedure: limiting resources of recently started processes
This procedure is not-reboot safe; making it reboot safe might later be added below.
- Select a name for the group of processes you will be controlling (e.g. “ssh-procs”, “procs-run-as-user-fred”, “my-test-procs”).
- Select the set of cgroup subsystems (e.g. CPU, memory) whose usage by a group of processes you will be asking the cgroup subsystem (e.g. cpu, cpuacct, memory) to limit.
- Create a mountpoint for the set of subsystems and mount the set:
mkdir /sys/fs/cgroup/devices,memory,cpu,cpuacct mount -t cgroup -o devices,memory,cpu,cpuacct cgroup /sys/fs/cgroup/devices,memory,cpu,cpuacct
- Create a subdirectory to allow specific limits to be applied to a specific set of processes (the cgroup!) for the resources:
mkdir /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs
- If you do not have a suitable process to manipulate, then compile this test program, whicg is referred to the in the output below as a.out:
#include <stdlib.h> /* for malloc() */ #include <unistd.h> /* for free() */ #include <stdio.h> /* for printf() */ #include <errno.h> /* for errno */ #include <error.h> /* for error() */ #include <strings.h> /* for rindex() */ #include <time.h> /* for nanosleep() */ # define usage() error(1,0,"Usage: %s <interval> <chunk-size> <progress-char>", argv[0]) char *progname; main( int argc, char *argv[]) { int i, j, chunksize; float interval; char *progress_string; struct timespec interval_timespec; void *mp; /* process arguments */ progname = rindex(argv[0], '/'); if (argc != 4) usage(); errno = 0; interval = atof(argv[1]); interval_timespec.tv_sec = (time_t) interval; interval_timespec.tv_nsec = (interval - (int) interval) * 1000000000; if (errno) usage(); chunksize = atoi(argv[2]); if (errno) usage(); if (chunksize < 0) usage(); progress_string = argv[3]; /* don't buffer stdout */ setbuf(stdout, NULL); /* loop */ for (i=0; ; ) { /* optimisation: don't call malloc() if malloc()ing zero bytes */ if (chunksize > 0 && (mp=malloc(chunksize)) == NULL) error(1,0,"%s%s: internal error; malloc() failed\n", progname, progress_string); /* populate as Linux's malloc() is opportunistic */ for (j=0; j<chunksize; j++) *(unsigned char *)(mp+j) = (unsigned char) j%256; i++; printf("%s", progress_string); /* optimisation: don't call (very expensive) nanosleep() if nanosleep()ing zero */ if (interval > 0.0) nanosleep(&interval_timespec, NULL); } }
- Run the test program, assign it to the cgroup and then apply a limit of 100MB memory:
# Every 1s get ~5MB memory, print '+' after each, bg it so pid in $! ./a.out 1 5000000 + & echo $! > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cgroup.procs # Set memory limit (order important) echo 100M > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.limit_in_bytes echo 100M > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.memsw.limit_in_bytes # Wait for shell to report that a.out got killed (kill -9, that is, i.e. untrappable) ... # Clear memory limit (order important) echo -1 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.memsw.limit_in_bytes echo -1 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/memory.limit_in_bytes
Note that:
- you cannot set memory.memsw.limit_in_bytes without setting memory.limit_in_bytes first. If you try to then you get:
-bash: echo: write error: Invalid argument
- you cannot set a limit after a process has exceeded that limit. If you try then you get:
-bash: echo: write error: Device or resource busy
- an explanation of the tunable parameters can be found here
- you cannot set memory.memsw.limit_in_bytes without setting memory.limit_in_bytes first. If you try to then you get:
- Run the test program, assign it to the cgroup and then apply a limit of 20% CPU:
# Every 0s get 0 bytes memory, print nothing after each, bg it so pid in $! ./a.out 0 0 '' & echo $! >/sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cgroup.procs # Note run interval (every 0.1ms) and allowed runtime per interval (unrestricted) cat /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_period_us 100000 cat /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_quota_us -1 # See it using 100% CPU top -n 1 -b | fgrep a.out 5275 alexis 20 0 4188 356 276 R 99.3 0.0 7:08.11 a.out # Calculate 20% of run interval and set that as allowed runtime per interval echo 20000 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_quota_us # See it using less than 20% CPU top -n 1 -b | fgrep a.out 5275 alexis 20 0 4188 356 276 R 13.2 0.0 7:03.22 a.out # Clear CPU limit echo -1 > /sys/fs/cgroup/devices,memory,cpu,cpuacct/my-test-procs/cpu.cfs_quota_us # See it using 100% CPU again top -n 1 -b | fgrep a.out 5275 alexis 20 0 4188 356 276 R 99.3 0.0 7:07.17 a.out
Note that:
- an explanation of the tunable parameters can be found here
- in the above two steps (for memory and CPU control respectively), two subsystems were managed via one mountpoint; in practice, this makes cgroup administration very difficult; we should use one mountpoint per subsystem.
- Run the test program, assign it to the cgroup and then “freeze” (i.e. suspend) and “thaw” (i.e. allow to run) it:
# Every 1s get 0 bytes memory, print '+' after each, bg it so pid in $! ./a.out 1 0 '+' & mkdir /sys/fs/cgroup/freezer/my-procs echo $! > /sys/fs/cgroup/freezer/my-test-procs/cgroup.procs # Note how process is currently thawed cat /sys/fs/cgroup/freezer/my-test-procs/freezer.state THAWED # Freeze the process echo FROZEN > /sys/fs/cgroup/freezer/my-test-procs/freezer.state # Check process is frozen ... # Thaw the process echo THAWED > /sys/fs/cgroup/freezer/my-test-procs/freezer.state # Check process has thawed ...
Note that:
- the freezer subsystem was already in use by a mount (probably triggered by installing cgroups-bin), so I had to make use of the existing access path, rather than preparing the mount myself, hence the path being different to the previous steps
- the shell does not detect that the process has been frozen (freezing doesn’t use signals)
- there are no tunable parameters for this subsystem
Procedure: switching to one mountpoint per subsystem
- Debian/Ubuntu: install cgroup-lite and reboot.
Procedure: verifying that a process can only be a member of one process group in a hierarchy
- Run
# we need a 'cat' command that does not create new processes. Implement it as a function cat() { while read LINE; do echo "$LINE"; done; } # Start a process sleep 1000 & PID=$! echo $PID # Record the pids in the root group cat < /sys/fs/cgroup/freezer/cgroup.procs > before.pids # Add the new process to a (non-root) cgroup echo $PID > /sys/fs/cgroup/freezer/my-test-procs/cgroup.procs # Rerecord the pids in the root cgroup cat < /sys/fs/cgroup/freezer/cgroup.procs > after.pids # Look! The new pid was *moved* from one cgroup to another, not copied diff before.pids after.pids 70d69 < 5849
- We can repeat this with a new cgroup inside the cgroup:
# Create a sub-cgroup mkdir /sys/fs/cgroup/freezer/my-test-procs/my-sub-test-procs # Put the process in it echo $PID > /sys/fs/cgroup/freezer/my-test-procs/my-sub-test-procs/cgroup.procs # The root cgroup already did not have the pid in it so was unaffected cat < /sys/fs/cgroup/freezer/cgroup.procs > after2.pids diff after.pids after2.pids # But look, it has been moved from my-test-procs to my-test-procs/my-sub-test-procs cat < /sys/fs/cgroup/freezer/my-test-procs/cgroup.procs cat < /sys/fs/cgroup/freezer/my-test-procs/my-sub-test-procs/cgroup.procs 5849
- Read this.
Procedure: enforcing resource limits
- Verify that the program cgrulesengd is installed; install if necessary.
- Edit /etc/cgrules.conf (it may not exist already) and add the following:
*:a.out cpu /my-test-group
- Run:
cgrules -d CGroup Rules Engine Daemon log started Current time: Fri Jun 12 10:55:39 2015 Opened log file: -, log facility: 0, log level: 7 Proceeding with PID 1121 Rule: *:a.out UID: any GID: any DEST: /my-test-group CONTROLLERS: cpu Started the CGroup Rules Engine Daemon.
- Now run a.out as follows:
./a.out 0 0 '' &
- Use top to monitor its CPU consumption, which should be around 20% 🙂
- Note:
- cgrulesengd does not start automatically; a init.d or systemd script is needed.
Questions
- Why does ‘mount’ output not show all cgroup mountpoints? Catting /proc/mounts does. E.g.:
test01# mount | grep cgroup/ systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd) cgroup on /sys/fs/cgroup/devices,memory,cpu,cpuacct type cgroup (rw,devices,memory,cpu,cpuacct) test01# grep cgroup/ /proc/mounts systemd /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,name=systemd 0 0 cgroup /sys/fs/cgroup/devices,memory,cpu,cpuacct cgroup rw,relatime,devices,memory,cpuacct,cpu 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,relatime,hugetlb 0 0 test01#