Configuring monitoring services using Check_MK (revision 3)

Introduction

This page describes how Alexis Huxley installed and configured a Checkmk server. It is based on the official documentation.

Installing theĀ Checkmk server

  1. Download the Checkmk Raw Edition directly to the target server:
    wget https://download.checkmk.com/checkmk/2.0.0p15/check-mk-raw-2.0.0p15_0.bullseye_amd64.deb
  2. Add the Checkmk repo key:
    wget -qO - https://download.checkmk.com/checkmk/Checkmk-pubkey.gpg | apt-key add -
  3. Install that package and dependendies:
    dpkg -i --force-depends check*.deb
    apt -y --fix-broken install
  4. Verify the installation as shown below:
    chifferi# omd version
    OMD - Open Monitoring Distribution Version 2.0.0p15.cre
    chifferi#
  5. Decide upon a monitoring site name (e.g. ‘default’).
  6. Create the site. E.g. with:
    omd create default
  7. Note the admin’s login and password as displayed in the output of the command in the previous step.
  8. If you installed Checkmk in a container:
    1. You may encounter this warning:
      Creating temporary filesystem /omd/sites/default/tmp...mount: /opt/omd/sites/default/tmp: must be superuser to use mount.
      WARNING: You may continue without tmpfs, but the performance of Checkmk may be degraded
    2. If not then skip the rest of this subsection.
    3. Shutdown the container.
    4. Add a filesystem to the container with the following specification:
      • Type: Ram
      • Driver: Loop
      • Usage: 512MB
      • Target path: /opt/omd/sites/default/tmp
    5. Restart the container and verify that that path is a tmpfs filesystem with the correct amount of space available, as show here:
      chifferi# df -h /opt/omd/sites/default/tmp
      Filesystem      Size  Used Avail Use% Mounted on
      tmpfs           512M     0  512M   0% /opt/omd/sites/default/tmp
      chifferi#
    6. Checkmk should have failed to start because of this:
      chifferi# ls -ld /opt/omd/sites/default /opt/omd/sites/default/tmp
      drwxrwxrwt 11 default default 260 Nov 22 14:30 /opt/omd/sites/default/tmp
      drwxrwxrwt 11 root    root    260 Nov 22 14:30 /opt/omd/sites/default/tmp
      chifferi#

      (I.e. the tmpfs mountpoint has not remembered that it should belong to default:default, which is because it is a tmpfs.)

    7. Edit /etc/systemd/system/omd.service and add the following line:
      ...
      [service]
      ...
      ExecStartPre=/bin/chown default:default /opt/omd/sites/default/tmp
      ExecStartPre=/bin/chmod 770 /opt/omd/sites/default/tmp
      ...
    8. Reboot the container.
    9. This time the container should successfully start at boot time.
  9. If you did not install Checkmk in a container and you did use PCMS to install the system:
    1. Adjust your PCMS site-specific configuration module so that the entry that has been added to /etc/fstab is made persistent.

Proxying connections to theĀ monitoring site

  1. Note the interface and port number specified in /omd/sites/default/etc/apache/listen-port.conf (probably 127.0.0.1:5000).
  2. Change the interface to all interfaces by running:
    omd stop default
    omd config default set APACHE_TCP_ADDR 0.0.0.0
    omd start default
  3. On the front-end webserver set up proxying to the back end webserver as described in Configuring web services (revision 2).
  4. On the front-end webserver add https support as described in Configuring web services (revision 2).
  5. To make certain clients go to the old site and other clients go to the new site, which is useful during a config migration, replace the proxying code with something like:
    RewriteEngine On
    RewriteCond %{REMOTE_ADDR} ^192\.168\.1\.16
    RewriteRule ^/(.+) http://chifferi.pasta.net:5000/$1 [P]
    ProxyPassReverse / http://chifferi.pasta.net:5000/
    RewriteCond %{REMOTE_ADDR} !^192\.168\.1\.16
    RewriteRule ^/(.+) http://penne.pasta.net:5000/$1 [P]
    ProxyPassReverse / http://penne.pasta.net:5000/

    Basics: changing cmkadmin’s password

    1. Run:
      su - default
      htpasswd -m ~/etc/htpasswd cmkadmin
      exit

    Basics: adding users

    1. Log in to the web interfaces as cmkadmin.
    2. Go to Setup–>Users–>Add User.
    3. Follow the prompts.
    4. Save the user and apply the change.
    5. Log out as cmkadmin and log in as the new user.
    6. Tailor the side panel to suit your personal preferences.

    Some notes about host tags and host labels

    Similarities:

    • both are variable+value pairs
    • both can be associated with a host or a service or a user, etc.
    • both may be referenced by rules

    Differences:

    • labels are defined on-the-fly, i.e. at the moment they need to be assigned; tags need to be predefined
    • labels are selected by typing their names; tags’ are selected from pulldown menus
    • consequently, when referencing labels for the second time, they could be misspelled with the result that rules do not get applied; because tags are selected from a pulldown menu, they cannot be misspelled
    • it is expected that a particular tag is assigned to all hosts; this is not the case for labels.
    • tags can be grouped into topics (like folders); labels cannot.
    • https://en.wikipedia.org/wiki/Taxonomy

    Terminology:

    • what I called a tag is actually a “tag group” followed by a “tag” (e.g. “location” followed by “Munich” or “location” followed by “Paris”)
    • an “auxilliary tag” is a “tag” that may be associated with multiple “tags” within a tag group (e.g. “Europe” is associated with “Munich” and with “Paris”) but can be referenced as it it was a normal tag (e.g. “location” followed by “Europe”); the result is that a tag group + tag is automatically inherited for the auxillary tag (e.g. a rule that depends on “location” + “Europe” will be activated for hosts that have the tag “location” + “Munich”).

    Formats:

    • labels have short ID codes of the format <variable>:<value> (e.g. “location:Munich”)
    • taggroups have short ID codes of the format <value> (e..g. “location”) and longer titles of the format <free-text> (e.g. “Location Name”)
    • tags have short ID codes of the format <value> (e..g. “munich”) and longer titles of the format <free-text> (e.g. ” Our Munich Office”)
    • auxillary tags have the same format as tags

    Setting up host tags and host labels (to be done before adding any hosts)

    1. Do not use labels; they are prone to misspelling and forgetting to add them to a host.
    2. On paper, design a taxonomy for tagging! This is much more complicated than it sounds! In case it helps I created the following tag groups with the following tags and auxilliary tags:
      • Alexis/Entity Type:
        • pm-server (Physical machine with “server” profile)
        • pm-desktop (Physical machine with “desktop” profile)
        • pm-laptop (Physical machine with “laptop” profile)
        • qemu-server (QEMU virtual machine with “server” profile)
        • qemu-desktop (QEMU virtual machine with “desktop” profile)
        • lxc-server (LXC container with “server” profile)
        • pm (Physical machine) [auxillary tag associated with all pm-* tags]
        • qemu (QEMU virtual machine) [auxillary tag associated with all qemu-* tags]
        • lxc (LXC container) [auxillary tag associated with all lxc-* tags]
        • server (has “server” profile) [auxillary tag associated with all *-server tags]
        • desktop (has “desktop” profile) [auxillary tag associated with all *-desktop tags]
        • laptop (has “laptop” profile) [auxillary tag associated with all *-laptop tags]
        • computer (Computer) [auxillary tag associated with all above tags]
        • netgear-switch (Netgear Switch)
        • fritzbox (Fritzbox)
        • website (Website)
        • ping-only-target (Ping-only target)
      • Alexis/Persistence:
        • persistent (Persistent)
        • transient (Transient)

      Note that Entity Type and Persistence are independent of each other.

    3. Go to Setup–>Tags–>Add tag group.
    4. Add all your tag groups and tags.
    5. Go to Setup–>Tags–>Add aux tag.
    6. Add all your auxilliary tags, associating them with the relevent tags.
    7. Apply the change.

    Service tags and labels (to be done before adding any hosts)

    Service tags and labels are functionally equivalent to host tags and labels, but are applied to services, rather than hosts.

    As noted above, each host tag group should be assigned to each host, albeit with the appropriate tag. I think this rule should also be valid for each service tag group. However, since there are hundreds of services, this is somehow less appropriate. Therefore I use labels to tag services rather than tags.

    My specific use case is that I have a script checking my websites via external proxies. The script is identical for checking each website, but the parameters are different. Since I have not yet written WATO configuration scripts to allow these checks all to be for the same service on different hosts but with different parameters (see below, if I ever get round to it!) then these services are actually registered as different services. But they are all subject to some Checkmk-specifiable parameters (e.g. try several times before sending a notification, as the proxies tend not to be reliable).

    So these are service-specific and cannot be created before the service. See ‘Special: using custom server-side service checking scripts’ below.

    Intelligible NIC naming (to be done before adding any hosts)

    This is useful so that when custom graphs are added to the dashboard you know which interface to monitor.

    1. Go to Setup–>Services–>Discovery rules–>Network interface and switch port discovery.
    2. Create a new rule with:
      • Configure discovery of single interfaces: YES
      • Appearance of network interface: alias
      • Port numbers: do not pad
    3. Apply the change.

    Basics: installing the Checkmk agent on all hosts and adding the hosts to Checkmk

    1. Install the Checkmk-provided agent on the Checkmk server (this is just to install the more modern client on a modern OS and have it talk with a modern Checkmk)
    2. Install the Checkmk-provided agent on all host hosts; Alexis should instead:
      1. Run something like this as repomaster@lagane:
        paa -v insert localprivate-deb buster,bullseye main ~alexis/check-mk-agent_2.0.0p15-1_all.deb
        paa -v control localprivate-deb
      2. Run pcms on all hosts.
    3. Verify that all hosts now have installed the correct version of the agent.
    4. Go to Setup–>Hosts–>Hosts–>Import hosts via CSV file.
    5. Paste in the hosts, disable ‘Has title line’, from the pull-down menu select Hostname, enable ‘Perform automatic service discovery’ .
    6. Click Upload.
    7. Click ‘Update preview’ and If it looks good then click ‘Import’.
    8. Go to Setup–>Hosts and, for each host, pay special attention that
      1. all tag groups have correct tag selected (don’t select auxilliary tags; those are referenced in rules, not directly by hosts)
      2. the Network Segment is set to ‘WAN (high latency)’ where appropriate
    9. Click ‘Save & go to service configuration’.
    10. Click ‘Fix all’ (possibly more that once).
    11. Apply the change by clicking the yellow ‘!’ icon (top right of main panel) and the pressing ‘Activate on selected sites’.

      Special: performing checks over ssh

      This was taken from a forum discussion.

      1. On the server run:
        omd sites
        omd su <site-name>
        ssh-keygen
        cat .ssh/id_rsa.pub
      2. On the client, add a line like the following to ~root/.ssh/authorized_keys:
        command="/usr/bin/check_mk_agent" ssh-rsa ...
        
      3. On the server add a suitable entry to ~/.ssh/config and test with:
        ssh root@<client>
      4. If the client is on another network segment (which it probably is if you’re checking it via ssh) then go to Setup–>Hosts–><hostname>–>Custom Attributes–>Networking Segment: WAN
      5. Go to Setup–>Hosts–><hostname>–>Hosts (menu)–>Effective Parameters–>Other Integrations–>Individual program call instead agent access–>
      6. Create rule in the main directory with:
        • Commandline to execute: ssh -T -oStrictHostKeyChecking=no root@$HOSTADDRESS$
        • Explicit hosts: cercis
      7. Apply the change.

      Some notes about custom checks and plugins

      This section represents my understanding and may contain errors.

      A check:

      • discovers and monitors (i.e. reports if status is good or bad) one (e.g. memory consumption) or more services (e.g. how full is the root filesystem, how full is the /usr filesystem, how full is the /tmp filesystem, etc)
      • is typically just one file written in any language
      • must produce one line per detected service with each line having the format:
        <status> "<service-name>" {<metrics>|-} <status-detail>

        (Note that the output mentions service names, which, even though they are all detected by the same check, are all different.)

      • Checks need to be installed on all clients (and remember that the service itself is also a client) in /usr/lib/check_mk_agent/local
      • Installation of the checks on all clients can be done by the bakery facility (Checkmk Enterprise only) or by any other method (e.g. scp, ansible, etc).
      • The Checkmk agent (/usr/bin/check_mk_agent) will automatically reformat the output to match the format produced by plugins.
      • Checks that other people have written are available on The Monitoring Plugins Project
      • More details about checks can be found here.

      A plugin:

      • discovers and displays raw data (i.e. it does not report if status is good or bad) about one or more services
      • is typically several files: a data collector (written in any language), a server-side script to register the check and parse the data (written in Python) and optionally a metrics (i.e. graphs) config file, a perf-o-meter config file, a WATO config file (all written in Python)
      • is at least one file (the data collector) but may include more (service rules for use from Checkmk’s web interface, maybe a Perf-o-Meter)
      • the data collector must produce output of the format:
        <<<<check-name>>>>
        <services-data-spanning-multiple-lines-in-any-format>

        (That first line is <check-name> enclosed in three angles brackets.)

      • Data collectors need to be installed on all clients (and remember that the service itself is also a client) in /usr/lib/check_mk_agent/plugins; the other components need to be installed under /opt/sites/<sitename>
      • Installation of the data collectors on all clients can be done by the bakery facility (Checkmk Enterprise only) or by any other method (e.g. scp, ansible, etc)
      • Plugins that other people have written are available on Checkmk Exchange in .mkp format
      • the mkp command can unpack .mkp files and install the server-side components in the right place but the client-side data collector still needs to be installed on all the clients in /usr/lib/check_mk_agent/plugins
      • More details about plugins can be found here.

      As an exercise:

      1. install the ‘hello world’ plugin from the Checkmk Exchange as documented below
      2. wait until you have seen alerts for all hosts
      3. to disable the check while leaving it installed:
        1. Go to Setup–>Services–>Service discovery Rules–>Disabled services.
        2. Create a new rule in the main directory with:
          • Services: Hello World
        3. Apply the change.

      Special: using custom client-side service checks

      1. Write the script; it should return 0 on success, 1 for warning, 2 for critical and 3 for unknown (see the Nagios Plugin API docs for more details).
      2. Install the script in /usr/lib/check_mk_agent/local on all clients; Alexis should instead complete ‘Special: installing my monitoring module’ below.

      Special: using custom server-side service checks

      1. Write the script; it should return 0 on success, 1 for warning, 2 for critical and 3 for unknown (see the Nagios Plugin API docs for more details).
      2. Install the script in /omd/sites/<sitename>/local/lib/nagios/plugins; Alexis should instead complete ‘Special: installing my monitoring module’ below.
      3. Go to Setup–>Services–>Other services–>Integrate Nagios plugins.
      4. Create a new rule in the main directory with:
        • Description (e.g. “Check access to www.pasta.freemyip.com”)
        • Service description (e.g. “website-wordpress-alexis”)
        • Command line (e.g. “check-with-multiple-proxies -i $SERVICEDESC$ -u https://$HOSTNAME$/ -s Gardening”)
        • Explicit hosts (e.g. “www.pasta.freemyip.com”)
      5. Apply the change.
      6. To allow several services to be grouped together for the purpose of applying a rule to the group (e.g. allow these services to fail a few more times than normal before sending a notification) do the following:
        1. Note the service descriptions
        2. Go to: Setup–>Services–>Service monitoring rules–>Service labels (see above for why we do not use service tags).
        3. Create a new rule in the main directory with:
          • Label: <some-label> (e.g. “unreliable-check:yes”)
          • Services: <list-of-services> (e.g. “website-wordpress-alexis, website-wordpress-judith, website-wordpress-suzie, website-openproject, website-debian, website-sources, website-redhat, website-svn-main”)
        4. Save the change but don’t bother applying it just yet.
        5. See the section ‘Special: allow brief service failure’ below.

      Special: installing plugins from Checkmk Exchange

      1. Visit the Checkmk Exchange, search or browse your way to the desired plugin and copy the download URL.
      2. On the Checkmk server run:
        omd su <sitename>
        wget -q <url>
        mkp <downloaded-file>
        mkp list <plugin-name> | grep agents/

        That last command shows the file that needs to be installed on all the clients.

      3. Install that file on all the clients; how you do this depends on your environment: if you have Checkmk Enterprise Edition then the bakery can do it; if you have Checkmk Raw Edition then you need to use scp or Ansible or some other distribution method. Remember: you want to install it into /usr/lib/check_mk_agent/plugins on all clients and remember the server is also a client; Alexis should instead complete ‘Special: installing my monitoring module’ below.
      4. Go to Setup–>Hosts–>Hosts (menu)–>Discover services.
      5. On the Mode menu select ‘Refresh all services (tabula rasa)’, click Start and wait for the scan to complete.
      6. Apply the changes.

      Special: writing custom client-side service plugins

      This is way too complicated to describe here fully.

      1. Make sure you have installed the hello_world plugin installed (see above).
      2. Note the important components of the hello_world plugin on the Checkmk server:
        chifferi# omd su default
        OMD[default]:~$ mkp list hello_world | sed "s@$(pwd)/@@" | egrep -v '(checkman)' | cat -n
        1  local/lib/check_mk/base/plugins/agent_based/hello_world.py
        2  local/share/check_mk/agents/plugins/hello_world
        3  local/share/check_mk/web/plugins/metrics/helloworld_metric.py
        4  local/share/check_mk/web/plugins/perfometer/helloworld_perfometer.py
        5  local/share/check_mk/web/plugins/wato/helloworld_parameters.py
        OMD[default]:~$

        The five files are:

        1. define discover function, define monitor function (which compares raw data with thresholds and derivates service status), register both functions with Checkmk server
        2. data collector program (to be coplied to all clients, including the server)
        3. define graphs
        4. define perf-o-meters
        5. define check’s default thresholds, what is modifiable in WATO
      3. Write the data collector program first; the output format is your own choice because – later – you will write a Python function to parse it (I wrote ‘dummy’).

      Special: installing my monitoring module

      I store all my monitoring scripts in an SVN module, so I just need to checkout the module and replace a few directories with symlinks as follows.

      1. If adding a plugin then add it in /usr/local/opt/monitoringtools/agent-plugins/bin and commit.
      2. As root on lagane run:
        rocon -c 'screen -d -m pcms --no-make --no-update-os --no-report' AllHosts

        (pcms will call /etc/pcms/site-config/plugins/install-pasta-monitoringtools, which will update /usr/local/opt/monitoringtools and ensure that /usr/lib/check_mk_agent/local, /usr/lib/check_mk_agent/plugins and /omd/sites/default/local/lib/nagios/plugins are all symlinks pointing into /usr/local/opt/monitoringtools.)

      Special: Increasing the service check timeout

      Per-service timeouts are provided by the CheckMK micro core but the CheckMK micro core is only available in CheckMK enterprise editions. However, a global service timeout is configurable.

      1. Run:
        omd su <site-name>
        vi etc/nagios/nagios.cfg
      2. Add the line:
        service_check_timeout=120
        
      3. Stop and start the site and exit the shell:
        omd stop
        omd start
        exit

      Special: Handling transient hosts

      The official documention aims to simulate that transient hosts and their services are all always up. I prefer that transient hosts are marked down and their services are not reported while in that state. This is already the default, except that cercis is pingable even when down (as its IP gets used by somebody else). So I need to choose a different way to monitor whether it is up or down.

      1. Go to Setup–>Hosts–>Host monitoring rules–>Host Check Command.
      2. Create a new rule in the main directory with:
        1. Host Check Command: TCP Connect/<secret>
        2. Explicit hosts: cercis
      3. Apply the changes.

        Special: monitoring websites

        1. Add the host but with:
          1. Checkmk agent: No API integrations, no Checkmk agent
          2. Networking segment: WAN
          3. Host tags: Entity type: Website (do the other taggroups too, but this one is particularly relevant)
        2. Go to Setup–>Services–>Other services–>Integrate Nagios plugins
        3. Create a new rule in the main directory with:
          1. Command:
            check-http-via-multiple-proxies -i $SERVICEDESC$ -u https://$HOSTNAME$/ -s Gardening

            (Adjust the search string to suit the particular website being monitored.)

          2. Explicit hostnames: <name-of-host>
        4. Note that: when I replaced my front-end webserver, it became apparent that Checkmk was also checking the host. (It became apparent because Checkmk caches host IPs and, due to the new front-end webserver having a different IP address, …). The host itself should not be being monitored. The best way to address is this to take the host status from a service status.
        5. Go to Setup–>Hosts–>Host monitoring rules–>Host check command.
        6. Create a new rule in the main directory with:
          1. Host Check Command: Always assume host to be up
          2. Host tags: Entity type: Website
        7. Apply the changes.

        Special: How to tell CheckMK that the current value for a service is correct when it thinks it’s wrong

        The specific case I needed this for was to totally disable the shared-memory-to-total-memory-ratio check on containers; see here for more info.

        1. Go to Setup–>Services–>Service monitoring rules–>Memory and Swap usage on Linux.
        2. Create a new rule in the main directory with:
          1. Upper levels for Shared memory: Do not impose levels
          2. Host tags: is LXC container
        3. Apply the changes.

        Special: Ping-only targets

        Special: FritzBoxes

        1. To allow CheckMK to collect statistics, ensure UPNP is enabled on the FritzBox (Home Network–>Network–>Network Settings–>Transmit status information over UPnP: YES).
        2. Go to Setup–>Agents–>Other integrations–>Fritz!Box Devices.
        3. Create a new rule in the main directory with:
          • Host tags: Entity Type is FritzBox (this relies on you having setup such a tag group and tag, as described above).
        4. Apply the changes.

        Special: allow more threads on desktops and latops

        1. Go to Setup–>Services–>Enforced Services–>Number of threads.
        2. Create a new rule in the main directory with:
          • Check type: cpu_threads
          • Warning at: 4000 (was 2000)
          • Critical at: 5000 (was 4000)
          • Host tags: has “laptop” profile
        3. Clone the rule and change:
          • Host tags: has “desktop” profile

          (A single rule are applied if all tags match, not if any tags mathes; so we need to use multiple rules.)

        4. Apply the changes.

        Special: calling a custom check multiple times for different services all on one host

        Special: allow brief host failure

        This is relevant for host cercis.

        1. Go to Setup–>Hosts–>Host monitoring Rules–>Maximum number of check attempts for host.
        2. Create a new rule in the main directory with:
          • Maximum number of check attempts for host: 3
          • Explicit host: cercis

        Special: allow brief service failure

        This is relevant for my services using the check-http-via-multiple-proxies check.

        1. Go to Setup–>Services–>Services–>Service monitoring rules–>Maximum number of check attempts for service.
        2. Create a new rule in the main directory with:
          • Maximum number of check attempts for service: 3
          • Service labels: has unreliable-check:yes
        3. Apply the changes.

        See also