Introduction
This page describes how Alexis Huxley installed and configured a Checkmk server. It is based on the official documentation.
Installing theĀ Checkmk server
- Download the Checkmk Raw Edition directly to the target server:
wget https://download.checkmk.com/checkmk/2.0.0p15/check-mk-raw-2.0.0p15_0.bullseye_amd64.deb
- Add the Checkmk repo key:
wget -qO - https://download.checkmk.com/checkmk/Checkmk-pubkey.gpg | apt-key add -
- Install that package and dependendies:
dpkg -i --force-depends check*.deb apt -y --fix-broken install
- Verify the installation as shown below:
chifferi# omd version OMD - Open Monitoring Distribution Version 2.0.0p15.cre chifferi#
- Decide upon a monitoring site name (e.g. ‘default’).
- Create the site. E.g. with:
omd create default
- Note the admin’s login and password as displayed in the output of the command in the previous step.
- If you installed Checkmk in a container:
- You may encounter this warning:
Creating temporary filesystem /omd/sites/default/tmp...mount: /opt/omd/sites/default/tmp: must be superuser to use mount. WARNING: You may continue without tmpfs, but the performance of Checkmk may be degraded
- If not then skip the rest of this subsection.
- Shutdown the container.
- Add a filesystem to the container with the following specification:
- Type: Ram
- Driver: Loop
- Usage: 512MB
- Target path: /opt/omd/sites/default/tmp
- Restart the container and verify that that path is a tmpfs filesystem with the correct amount of space available, as show here:
chifferi# df -h /opt/omd/sites/default/tmp Filesystem Size Used Avail Use% Mounted on tmpfs 512M 0 512M 0% /opt/omd/sites/default/tmp chifferi#
- Checkmk should have failed to start because of this:
chifferi# ls -ld /opt/omd/sites/default /opt/omd/sites/default/tmp drwxrwxrwt 11 default default 260 Nov 22 14:30 /opt/omd/sites/default/tmp drwxrwxrwt 11 root root 260 Nov 22 14:30 /opt/omd/sites/default/tmp chifferi#
(I.e. the tmpfs mountpoint has not remembered that it should belong to default:default, which is because it is a tmpfs.)
- Edit /etc/systemd/system/omd.service and add the following line:
... [service] ... ExecStartPre=/bin/chown default:default /opt/omd/sites/default/tmp ExecStartPre=/bin/chmod 770 /opt/omd/sites/default/tmp ...
- Reboot the container.
- This time the container should successfully start at boot time.
- You may encounter this warning:
- If you did not install Checkmk in a container and you did use PCMS to install the system:
- Adjust your PCMS site-specific configuration module so that the entry that has been added to /etc/fstab is made persistent.
- The performance of Check_MK can be increased by moving /opt/omd/sites/default/var onto a RAM disk too, but this directory contains data that should persist across reboots, so a RAM disk needs to be be restored and saved as Check_MK itself is started and stopped. Do this as follows:
- Run:
systemctl stop omd cd /opt/omd/sites/default mv var var-disk mkdir var
- Edit the service script /etc/systemd/system/omd.service and, after any other
ExecStartPre
lines, add the following:ExecStartPre=/usr/bin/rsync -a --delete /opt/omd/sites/default/var-disk/ /opt/omd/sites/default/var/ ExecStartPost=/usr/bin/rsync -a --delete /opt/omd/sites/default/var/ /opt/omd/sites/default/var-disk/
- Shutdown the VM.
- Add a 4096MB RAM disk (virt-manager–>Add Hardware–>Filesystem) with target path /opt/omd/sites/default/var.
- Restart the VM and log in as root.
- Verify the ownership of /opt/omd/sites/default/var, which will indicate that the rsync populated the initiallty-empty root-owned RAM disk, as in this example:
chifferi# ls -ld /opt/omd/sites/default/var drwxr-xr-x 15 default default 300 Sep 10 16:24 /opt/omd/sites/default/var chifferi#
- Go to the Check_MK web interface and verify that old data is shown in the graphs.
- Make a backup of /opt/omd/sites/default/var, but do not store it in the VM as it’s probably a bit short of space for that:
- Shutdown Check_MK with:
systemctl stop omd
- On the VM server where the VM is running cd to where the VM’s root disk is mounted and then cd further and make a backup as follows:
cd opt/omd/sites/default/var-disk tar czf /scratch/chifferi-opt-omd-sites-default-var-disk-$(date +%Y%m%d%H%M%S).tar.gz .
- Back on the VM, restart Check_MK with:
systemctl start omd
- Shutdown Check_MK with:
- As an extra safely measure, set up a cronjob to be run as root to copy the memory cache back to disk, e.g.:
42 */4 * * * rsync -a --delete /opt/omd/sites/default/var/ /opt/omd/sites/default/var-disk/
- Finally, remember to make sure that the updated VM definition is on both VM servers.
- Run:
Proxying connections to theĀ monitoring site
- Note the interface and port number specified in /omd/sites/default/etc/apache/listen-port.conf (probably 127.0.0.1:5000).
- Change the interface to all interfaces by running:
omd stop default omd config default set APACHE_TCP_ADDR 0.0.0.0 omd start default
- On the front-end webserver set up proxying to the back end webserver as described in Configuring web services (revision 2).
- On the front-end webserver add https support as described in Configuring web services (revision 2).
- To make certain clients go to the old site and other clients go to the new site, which is useful during a config migration, replace the proxying code with something like:
RewriteEngine On RewriteCond %{REMOTE_ADDR} ^192\.168\.1\.16 RewriteRule ^/(.+) http://chifferi.pasta.net:5000/$1 [P] ProxyPassReverse / http://chifferi.pasta.net:5000/ RewriteCond %{REMOTE_ADDR} !^192\.168\.1\.16 RewriteRule ^/(.+) http://penne.pasta.net:5000/$1 [P] ProxyPassReverse / http://penne.pasta.net:5000/
Basics: changing cmkadmin’s password
- Run:
su - default htpasswd -m ~/etc/htpasswd cmkadmin exit
Basics: adding users
- Log in to the web interfaces as cmkadmin.
- Go to Setup–>Users–>Add User.
- Follow the prompts.
- Save the user and apply the change.
- Log out as cmkadmin and log in as the new user.
- Tailor the side panel to suit your personal preferences.
Some notes about host tags and host labels
Similarities:
- both are variable+value pairs
- both can be associated with a host or a service or a user, etc.
- both may be referenced by rules
Differences:
- labels are defined on-the-fly, i.e. at the moment they need to be assigned; tags need to be predefined
- labels are selected by typing their names; tags’ are selected from pulldown menus
- consequently, when referencing labels for the second time, they could be misspelled with the result that rules do not get applied; because tags are selected from a pulldown menu, they cannot be misspelled
- it is expected that a particular tag is assigned to all hosts; this is not the case for labels.
- tags can be grouped into topics (like folders); labels cannot.
- https://en.wikipedia.org/wiki/Taxonomy
Terminology:
- what I called a tag is actually a “tag group” followed by a “tag” (e.g. “location” followed by “Munich” or “location” followed by “Paris”)
- an “auxilliary tag” is a “tag” that may be associated with multiple “tags” within a tag group (e.g. “Europe” is associated with “Munich” and with “Paris”) but can be referenced as it it was a normal tag (e.g. “location” followed by “Europe”); the result is that a tag group + tag is automatically inherited for the auxillary tag (e.g. a rule that depends on “location” + “Europe” will be activated for hosts that have the tag “location” + “Munich”).
Formats:
- labels have short ID codes of the format <variable>:<value> (e.g. “location:Munich”)
- taggroups have short ID codes of the format <value> (e..g. “location”) and longer titles of the format <free-text> (e.g. “Location Name”)
- tags have short ID codes of the format <value> (e..g. “munich”) and longer titles of the format <free-text> (e.g. ” Our Munich Office”)
- auxillary tags have the same format as tags
Setting up host tags and host labels (to be done before adding any hosts)
- Do not use labels; they are prone to misspelling and forgetting to add them to a host.
- On paper, design a taxonomy for tagging! This is much more complicated than it sounds! In case it helps I created the following tag groups with the following tags and auxilliary tags:
- Alexis/Entity Type:
- pm-server (Physical machine with “server” profile)
- pm-desktop (Physical machine with “desktop” profile)
- pm-laptop (Physical machine with “laptop” profile)
- qemu-server (QEMU virtual machine with “server” profile)
- qemu-desktop (QEMU virtual machine with “desktop” profile)
- lxc-server (LXC container with “server” profile)
- pm (Physical machine) [auxillary tag associated with all pm-* tags]
- qemu (QEMU virtual machine) [auxillary tag associated with all qemu-* tags]
- lxc (LXC container) [auxillary tag associated with all lxc-* tags]
- server (has “server” profile) [auxillary tag associated with all *-server tags]
- desktop (has “desktop” profile) [auxillary tag associated with all *-desktop tags]
- laptop (has “laptop” profile) [auxillary tag associated with all *-laptop tags]
- computer (Computer) [auxillary tag associated with all above tags]
- netgear-switch (Netgear Switch)
- fritzbox (Fritzbox)
- website (Website)
- ping-only-target (Ping-only target)
- Alexis/Persistence:
- persistent (Persistent)
- transient (Transient)
- Alexis/Runs-LXC-Containers:
- true (does run LXC containers)
- false (does not run LXC containers)
Note that Entity Type and Persistence are independent of each other.
- Alexis/Entity Type:
- Go to Setup–>Tags–>Add tag group.
- Add all your tag groups and tags.
- Go to Setup–>Tags–>Add aux tag.
- Add all your auxilliary tags, associating them with the relevent tags.
- Apply the change.
Service tags and labels (to be done before adding any hosts)
Service tags and labels are functionally equivalent to host tags and labels, but are applied to services, rather than hosts.
As noted above, each host tag group should be assigned to each host, albeit with the appropriate tag. I think this rule should also be valid for each service tag group. However, since there are hundreds of services, this is somehow less appropriate. Therefore I use labels to tag services rather than tags.
My specific use case is that I have a script checking my websites via external proxies. The script is identical for checking each website, but the parameters are different. Since I have not yet written WATO configuration scripts to allow these checks all to be for the same service on different hosts but with different parameters (see below, if I ever get round to it!) then these services are actually registered as different services. But they are all subject to some Checkmk-specifiable parameters (e.g. try several times before sending a notification, as the proxies tend not to be reliable).
So these are service-specific and cannot be created before the service. See ‘Special: using custom server-side service checking scripts’ below.
Intelligible NIC naming (to be done before adding any hosts)
This is useful so that when custom graphs are added to the dashboard you know which interface to monitor.
- Go to Setup–>Services–>Discovery rules–>Network interface and switch port discovery.
- Create a new rule with:
- Configure discovery of single interfaces: YES
- Appearance of network interface: alias
- Port numbers: do not pad
- Apply the change.
Basics: installing the Checkmk agent on all hosts and adding the hosts to Checkmk
- Install the Checkmk-provided agent on the Checkmk server (this is just to install the more modern client on a modern OS and have it talk with a modern Checkmk)
- Install the Checkmk-provided agent on all host hosts; Alexis should instead:
- Run something like this as repomaster@lagane:
paa -v insert localprivate-deb buster,bullseye main ~alexis/check-mk-agent_2.0.0p15-1_all.deb paa -v control localprivate-deb
- Run pcms on all hosts.
- Run something like this as repomaster@lagane:
- Verify that all hosts now have installed the correct version of the agent.
- Go to Setup–>Hosts–>Hosts–>Import hosts via CSV file.
- Paste in the hosts, disable ‘Has title line’, from the pull-down menu select Hostname, enable ‘Perform automatic service discovery’ .
- Click Upload.
- Click ‘Update preview’ and If it looks good then click ‘Import’.
- Go to Setup–>Hosts and, for each host, pay special attention that
- all tag groups have correct tag selected (don’t select auxilliary tags; those are referenced in rules, not directly by hosts)
- the Network Segment is set to ‘WAN (high latency)’ where appropriate
- Click ‘Save & go to service configuration’.
- Click ‘Fix all’ (possibly more that once).
- Apply the change by clicking the yellow ‘!’ icon (top right of main panel) and the pressing ‘Activate on selected sites’.
Special: performing checks over ssh
This was taken from a forum discussion.
- On the server run:
omd sites omd su <site-name> ssh-keygen cat .ssh/id_rsa.pub
- On the client, add a line like the following to ~root/.ssh/authorized_keys:
command="/usr/bin/check_mk_agent" ssh-rsa ...
- On the server add a suitable entry to ~/.ssh/config and test with:
ssh root@<client>
- If the client is on another network segment (which it probably is if you’re checking it via ssh) then go to Setup–>Hosts–><hostname>–>Custom Attributes–>Networking Segment: WAN
- Go to Setup–>Hosts–><hostname>–>Hosts (menu)–>Effective Parameters–>Other Integrations–>Individual program call instead agent access–>
- Create rule in the main directory with:
- Commandline to execute: ssh -T -oStrictHostKeyChecking=no root@$HOSTADDRESS$
- Explicit hosts: cercis
- Apply the change.
Some notes about custom checks and plugins
This section represents my understanding and may contain errors.
A check:
- discovers and monitors (i.e. reports if status is good or bad) one (e.g. memory consumption) or more services (e.g. how full is the root filesystem, how full is the /usr filesystem, how full is the /tmp filesystem, etc)
- is typically just one file written in any language
- must produce one line per detected service with each line having the format:
<status> "<service-name>" {<metrics>|-} <status-detail>
(Note that the output mentions service names, which, even though they are all detected by the same check, are all different.)
- Checks need to be installed on all clients (and remember that the service itself is also a client) in /usr/lib/check_mk_agent/local
- Installation of the checks on all clients can be done by the bakery facility (Checkmk Enterprise only) or by any other method (e.g. scp, ansible, etc).
- The Checkmk agent (/usr/bin/check_mk_agent) will automatically reformat the output to match the format produced by plugins.
- Checks that other people have written are available on The Monitoring Plugins Project
- More details about checks can be found here.
A plugin:
- discovers and displays raw data (i.e. it does not report if status is good or bad) about one or more services
- is typically several files: a data collector (written in any language), a server-side script to register the check and parse the data (written in Python) and optionally a metrics (i.e. graphs) config file, a perf-o-meter config file, a WATO config file (all written in Python)
- is at least one file (the data collector) but may include more (service rules for use from Checkmk’s web interface, maybe a Perf-o-Meter)
- the data collector must produce output of the format:
<<<<check-name>>>> <services-data-spanning-multiple-lines-in-any-format>
(That first line is <check-name> enclosed in three angles brackets.)
- Data collectors need to be installed on all clients (and remember that the service itself is also a client) in /usr/lib/check_mk_agent/plugins; the other components need to be installed under /opt/sites/<sitename>
- Installation of the data collectors on all clients can be done by the bakery facility (Checkmk Enterprise only) or by any other method (e.g. scp, ansible, etc)
- Plugins that other people have written are available on Checkmk Exchange in .mkp format
- the mkp command can unpack .mkp files and install the server-side components in the right place but the client-side data collector still needs to be installed on all the clients in /usr/lib/check_mk_agent/plugins
- More details about plugins can be found here.
As an exercise:
- install the ‘hello world’ plugin from the Checkmk Exchange as documented below
- wait until you have seen alerts for all hosts
- to disable the check while leaving it installed:
- Go to Setup–>Services–>Service discovery Rules–>Disabled services.
- Create a new rule in the main directory with:
- Services: Hello World
- Apply the change.
Special: using custom client-side service checks
- Write the script; it should return 0 on success, 1 for warning, 2 for critical and 3 for unknown (see the Nagios Plugin API docs for more details).
- Install the script in /usr/lib/check_mk_agent/local on all clients; Alexis should instead complete ‘Special: installing my monitoring module’ below.
Special: using custom server-side service checks
- Write the script; it should return 0 on success, 1 for warning, 2 for critical and 3 for unknown (see the Nagios Plugin API docs for more details).
- Install the script in /omd/sites/<sitename>/local/lib/nagios/plugins; Alexis should instead complete ‘Special: installing my monitoring module’ below.
- Go to Setup–>Services–>Other services–>Integrate Nagios plugins.
- Create a new rule in the main directory with:
- Description (e.g. “Check access to www.pasta.freemyip.com”)
- Service description (e.g. “website-wordpress-alexis”)
- Command line (e.g. “check-with-multiple-proxies -i $SERVICEDESC$ -u https://$HOSTNAME$/ -s Gardening”)
- Explicit hosts (e.g. “www.pasta.freemyip.com”)
- Apply the change.
- To allow several services to be grouped together for the purpose of applying a rule to the group (e.g. allow these services to fail a few more times than normal before sending a notification) do the following:
- Note the service descriptions
- Go to: Setup–>Services–>Service monitoring rules–>Service labels (see above for why we do not use service tags).
- Create a new rule in the main directory with:
- Label: <some-label> (e.g. “unreliable-check:yes”)
- Services: <list-of-services> (e.g. “website-wordpress-alexis, website-wordpress-judith, website-wordpress-suzie, website-openproject, website-debian, website-sources, website-redhat, website-svn-main”)
- Save the change but don’t bother applying it just yet.
- See the section ‘Special: allow brief service failure’ below.
Special: installing plugins from Checkmk Exchange
- Visit the Checkmk Exchange, search or browse your way to the desired plugin and copy the download URL.
- On the Checkmk server run:
omd su <sitename> wget -q <url> mkp <downloaded-file> mkp list <plugin-name> | grep agents/
That last command shows the file that needs to be installed on all the clients.
- Install that file on all the clients; how you do this depends on your environment: if you have Checkmk Enterprise Edition then the bakery can do it; if you have Checkmk Raw Edition then you need to use scp or Ansible or some other distribution method. Remember: you want to install it into /usr/lib/check_mk_agent/plugins on all clients and remember the server is also a client; Alexis should instead complete ‘Special: installing my monitoring module’ below.
- Go to Setup–>Hosts–>Hosts (menu)–>Discover services.
- On the Mode menu select ‘Refresh all services (tabula rasa)’, click Start and wait for the scan to complete.
- Apply the changes.
Special: writing custom client-side service plugins
This is way too complicated to describe here fully.
- Make sure you have installed the hello_world plugin installed (see above).
- Note the important components of the hello_world plugin on the Checkmk server:
chifferi# omd su default OMD[default]:~$ mkp list hello_world | sed "s@$(pwd)/@@" | egrep -v '(checkman)' | cat -n 1 local/lib/check_mk/base/plugins/agent_based/hello_world.py 2 local/share/check_mk/agents/plugins/hello_world 3 local/share/check_mk/web/plugins/metrics/helloworld_metric.py 4 local/share/check_mk/web/plugins/perfometer/helloworld_perfometer.py 5 local/share/check_mk/web/plugins/wato/helloworld_parameters.py OMD[default]:~$
The five files are:
- define discover function, define monitor function (which compares raw data with thresholds and derivates service status), register both functions with Checkmk server
- data collector program (to be coplied to all clients, including the server)
- define graphs
- define perf-o-meters
- define check’s default thresholds, what is modifiable in WATO
- Write the data collector program first; the output format is your own choice because – later – you will write a Python function to parse it (I wrote ‘dummy’).
Special: installing my monitoring module
I store all my monitoring scripts in an SVN module, so I just need to checkout the module and replace a few directories with symlinks as follows.
- If adding a plugin then add it in /usr/local/opt/monitoringtools/agent-plugins/bin and commit.
- As root on lagane run:
rocon -c 'screen -d -m pcms --no-make --no-update-os --no-report' AllHosts
(pcms will call /etc/pcms/site-config/plugins/install-pasta-monitoringtools, which will update /usr/local/opt/monitoringtools and ensure that /usr/lib/check_mk_agent/local, /usr/lib/check_mk_agent/plugins and /omd/sites/default/local/lib/nagios/plugins are all symlinks pointing into /usr/local/opt/monitoringtools.)
Special: Increasing the service check timeout
Per-service timeouts are provided by the CheckMK micro core but the CheckMK micro core is only available in CheckMK enterprise editions. However, a global service timeout is configurable.
- Run:
omd su <site-name> vi etc/nagios/nagios.cfg
- Add the line:
service_check_timeout=120
- Stop and start the site and exit the shell:
omd stop omd start exit
Special: Handling transient hosts
The official documention aims to simulate that transient hosts and their services are all always up. I prefer that transient hosts are marked down and their services are not reported while in that state. This is already the default, except that cercis is pingable even when down (as its IP gets used by somebody else). So I need to choose a different way to monitor whether it is up or down.
- Go to Setup–>Hosts–>Host monitoring rules–>Host Check Command.
- Create a new rule in the main directory with:
- Host Check Command: TCP Connect/<secret>
- Explicit hosts: cercis
- Apply the changes.
Special: monitoring websites my old way
This procedure is not to be used but here for reference.
- Add the host but with:
- Checkmk agent: No API integrations, no Checkmk agent
- Networking segment: WAN
- Host tags: Entity type: Website (do the other taggroups too, but this one is particularly relevant)
- Go to Setup–>Services–>Other services–>Integrate Nagios plugins
- Create a new rule in the main directory with:
- Command:
check-http-via-multiple-proxies -i $SERVICEDESC$ -u https://$HOSTNAME$/ -s Gardening
(Adjust the search string to suit the particular website being monitored.)
- Explicit hostnames: <name-of-host>
- Command:
- Note that: when I replaced my front-end webserver, it became apparent that Checkmk was also checking the host. (It became apparent because Checkmk caches host IPs and, due to the new front-end webserver having a different IP address, …). The host itself should not be being monitored. The best way to address is this to take the host status from a service status.
- Go to Setup–>Hosts–>Host monitoring rules–>Host check command.
- Create a new rule in the main directory with:
- Host Check Command: Always assume host to be up
- Host tags: Entity type: Website
- Apply the changes.
Special: monitoring websites my new way
I used to monitor my own websites by getting an online list of open proxies and then visiting my websites via a proxy. But the proxies were too unreliable so I started accessing my sites through multiple proxies and taking the best result. But that was still too unreliable! Now I use a cronjob on my desktop machine at work to poll the websites and submit results via ssh
and Checkmk’s lq
command.
The requirement to save each host or rule added is not explicitly mentioned in the procedure below, so remember to do it!
- For each website:
- Add the website as a host with:
- Go to Setup–>Hosts–>Add Host
- Hostname: <website-name>
- Checkmk agent: No API integrations, no Checkmk agent
- Networking segment: WAN
- Alexis/Entity type: Website
- Map the website name to a Nagios-style service name, which we’ll reference shortly, as follows:
- Go to Setup–>Services–>Other services–>Integrate Nagios plugins–>Create rule in folder: Main directory
- Description: “Check access to website <website-name>”
- Service description: a Nagios-style service name, for example: website-wordpress-alexis
- Command line: <leave-disabled!>
- Explicit hosts: <website-name>
- On a machine that can accurately check the website status and that has ssh access to the CheckMK service itself, set up a cronjob to check the website and submit the result using CheckMK’s
lq
command, as in this example:HOST=www.pasta.freemyip.com SERVICE=website-wordpress-alexis STATE=0 # 0 is ok, 1 is warning, 2 is critical, this should be worked out, not hardcoded :-) COMMENT="website not accessible or does not contain specified search string" CHECKMK_SERVER=chifferi.pasta.net CHECKMK_SITE=default MESSAGE="COMMAND [$(date +%s)] PROCESS_SERVICE_CHECK_RESULT;$HOST;$SERVICE;$STATE;$COMMENT" echo "lq \"$MESSAGE\"" | ssh $CHECKMK_SERVER "omd su $CHECKMK_SITE"
It is tempting to think that messages could be submitted only on state changes (that the script itself detects) but, firstly, monitoring state changes is CheckMK‘s job and, secondly, without regular messages CheckMK will report the state as stale.
- Add the website as a host with:
- We are not interested in monitoring whether these hosts (not the websites, which are services on the hosts) are up or not, so we declare them to be always up as follows:
- Go to Setup–>Hosts–>Host monitoring rules–>Host Check Command–>Create rule in folder: Main directory
- Description: Make all websites’ host states (not service states) always up
- Host Check Command: Always assume host to be up
- Host tags: Alexis/Entity type is Website
- Note that because no command is associated with the Nagios-style service name, there is no need to create a rule to disable active checks on the website host. (When I transitioned from calling my check-http-via-multiple-proxies command to this new way of doing things, I temporarily created a “Enable/disable active checks for services” rule to prevent the command being run so that my new script, running on my desktop at work could have its state updates not be overwritten.)
Special: How to tell CheckMK that the current value for a service is correct when it thinks it’s wrong
The specific case I needed this for was to totally disable the shared-memory-to-total-memory-ratio check on containers; see here for more info.
- Go to Setup–>Services–>Service monitoring rules–>Memory and Swap usage on Linux.
- Create a new rule in the main directory with:
- Upper levels for Shared memory: Do not impose levels
- Host tags: is LXC container
- Apply the changes.
Special: Ping-only targets
Special: FritzBoxes
- To allow CheckMK to collect statistics, ensure UPNP is enabled on the FritzBox (Home Network–>Network–>Network Settings–>Transmit status information over UPnP: YES).
- Go to Setup–>Agents–>Other integrations–>Fritz!Box Devices.
- Create a new rule in the main directory with:
- Host tags: Entity Type is FritzBox (this relies on you having setup such a tag group and tag, as described above).
- Apply the changes.
Special: allow more threads on desktops and latops
- Go to Setup–>Services–>Enforced Services–>Number of threads.
- Create a new rule in the main directory with:
- Check type: cpu_threads
- Warning at: 4000 (was 2000)
- Critical at: 5000 (was 4000)
- Host tags: has “laptop” profile
- Clone the rule and change:
- Host tags: has “desktop” profile
(A single rule are applied if all tags match, not if any tags mathes; so we need to use multiple rules.)
- Apply the changes.
Special: calling a custom check multiple times for different services all on one host
Special: allow brief host failure
This is relevant for host cercis.
- Go to Setup–>Hosts–>Host monitoring Rules–>Maximum number of check attempts for host.
- Create a new rule in the main directory with:
- Maximum number of check attempts for host: 3
- Explicit host: cercis
Special: allow brief service failure
This is relevant for my services using the check-http-via-multiple-proxies check.
- Go to Setup–>Services–>Services–>Service monitoring rules–>Maximum number of check attempts for service.
- Create a new rule in the main directory with:
- Maximum number of check attempts for service: 3
- Service labels: has unreliable-check:yes
- Apply the changes.
Special: Distro LSB check warning
I have a check that checks the installed Linux distribution is the expected one. When a machine gets upgraded or a new host gets installed then it typically causes this alert. This procedure could be done on any machine, but doing it on the failing machine allows me to immediately check that the fix works, rather than waiting for PCMS to be invoked by cron.
- On the failing machine run:
cd /usr/local/opt/monitoringtools svn up ./agent-checks/check-distro
and note the displayed error code in column #1 (0 is ok, 1 is warning, 2 is critical).
- If, even following the
svn up
, the check still fails then:- edit that script to correct the expected distribution.
- Rerun the script to check that the displayed error code is now 0.
- Commit the changes.
- Wait a few minutes the warning should clear.
- Run the following command and verify the output is as shown:
nuvole# cd /etc/pcms/pcms-config nuvole# ./plugins/svn-up-monitoringtools -d 10 svn-up-monitoringtools: DEBUG[10]: main: already at required patchlevel nuvole#
- Edit that script and increment the value of PATCHLEVEL (close to the top of the file).
- Rerun the script and verify the output is as shown:
nuvole# ./plugins/svn-up-monitoringtools -d 10 svn-up-monitoringtools: DEBUG[10]: main: reapplying patch ... nuvole#
- Commit the changes.
Special: ignore vnet* interfaces
- Go to Setup–>Services–>Discovery rules–>Disabled services.
- Create a new rule in the main directory with:
- Description: Don’t monitor vnet interfaces attached QEMU VMs and LXC containers on VM servers
- Comment: They flap too much as VMs are migrated
- Explicit hosts: pici, ziti, testaroli, pestaroli
- Services: Interface vnet[0-9][0-9]*$
- Apply the changes.