Check MK
Checkmk is a software developed in Python and C++ for IT Infrastructure monitoring. It is used for the monitoring of servers, applications, networks, cloud infrastructures (public, private, hybrid), containers, storage, databases and environment sensors.[3]
Developer(s) | tribe29 GmbH (previously Mathias Kettner GmbH) |
---|---|
Initial release | 2008 |
Stable release | 1.6.0p16[1]
/ August 19, 2020 |
Repository | |
Written in | Python, C++ |
Operating system | Linux |
Type | IT Infrastructure Monitoring |
License | GNU GPL v2 and other Open Source licenses, Checkmk Enterprise License[2] |
Website | checkmk |
Checkmk is available in three editions:[4] an open source edition ("Checkmk Raw Edition – CRE"),[5] a commercial enterprise edition ("Checkmk Enterprise Edition – CEE") and a commercial edition for managed services providers ("Checkmk Managed Services Edition – CME"). These Checkmk-Editions are available for a range of platforms, in particular for various versions of Debian, Ubuntu, SLES and RedHat / CentOS, and also as a Docker Image.[6] In addition, physical appliances of various sizes as well as a virtual appliance are offered to simplify the administration of the underlying operating system through a graphical user interface and to enable high-availability solutions.
The agents used by Checkmk to collect data are available for 11 platforms, including Windows.[7]
History
Checkmk originated in 2008 as an Agent-substituting shell script for Inetd, and was published in April 2009 under GPL.[8] It was initially based on Nagios, and extended this with a number of new components.[9][10] The open source edition (Checkmk Raw Edition) also continues to be based on the Nagios-core, and bundles this with additional open source components into a complete system.[11]
Over many years Checkmk's commercial editions have evolved into a self-contained monitoring system – one that has replaced all of the essential Nagios components with its own – including its very own monitoring core.[12] The majority of the developments for the commercial editions, in particular all plug-ins, are also available into the Checkmk Raw Edition.
While in the past Checkmk was designed for monitoring large and heterogeneous on-premise environments, from version 1.5+ (1.5p12) it also supports the monitoring of AWS, Azure, Docker and Kubernetes services.[13]
Checkmk is being developed by tribe29 GmbH[14] in Munich Germany, which until 16.04.2019 operated under the name of Mathias Kettner GmbH. Together with the company name change, the product name "Check_MK" was also changed to "Checkmk".
tribe29 GmbH follows an open core business model. The open source edition is available under different open source licenses – mostly GPLv2, while large parts of the commercial editions run under the proprietary "Checkmk Enterprise License".
The Product
Checkmk combines three types of IT monitoring:
- Status-based monitoring, which (via thresholds) records the "health" of a device or application.
- Metric-based monitoring that enables the recording and analysis of time series graphs. For the CEE both an HTML5-based graphing system and an integration with Grafana[15] are available.
- Log-based and event-based monitoring, in which key events can be filtered out and actions can be triggered based on these events.
In order to ensure a very broad monitoring, Checkmk currently has 1700+ plug-ins in each edition – all of which are licensed under GPLv2. These plug-ins are maintained as part of the product and are regularly supplemented with additional plug-ins or extensions. Connecting existing legacy Nagios plug-ins is possible as well.
To simplify setup and operation, all components of Checkmk are delivered fully integrated. A rule-based 1:n configuration, as well as a high degree of automation significantly accelerate workflows. This includes:
- Auto-discovery of hosts (where applicable)
- Auto-discovery of services[16]
- Automated configuration of plug-ins via preconfigured thresholds and rules
- Automated agent updates (a CEE feature)
- Automatic and dynamic configuration that enables the monitoring of volatile services with a lifespan of just a few seconds, such as in the Kubernetes environment (starting from CEE v1.6)[17]
- Automated discovery of tags and labels from sources such as Kubernetes, AWS and Azure (starting from CEE v1.6)
In addition, there are also playbooks for the use of configuration and deployment tools such as Ansible[18] or Salt.[19]
Checkmk is often used in very large distributed environments where a high number of sites (e.g. 300 locations of Faurecia[20]) and/or well over 100.000 devices (e.g. Edeka[21]) are monitored. This is possible, among other things, because Checkmk's microcore consumes much less CPU resources than, for example Nagios, and therefore offers a significantly higher performance on the same hardware. Furthermore the non-persistent data is stored in-memory in RAM which significantly improves the access time.
Components
The Monitoring Core ("Checkmk Microcore - CMC")
The commercial editions of Checkmk use their own monitoring core, written in C++. This has a much higher performance than the Nagios core. In addition, as of version 1.6 it allows the dynamic recording of objects with a short lifespan, such as Containers. This is possible because - in contrast to the Nagios core - the Checkmk Microcore does not require a reboot when making configuration changes. The open source version "Checkmk Raw Edition" currently still uses the Nagios core.
Configuration & Check Engine
Checkmk offers self-contained service discovery and settings generation. Checkmk uses its own method when carrying out the checks. During the test period each host is contacted only once. The test results are transmitted to the monitoring core as passive checks. This significantly improves the performance on the monitoring server, as well as on the hosts being monitored.
Checkmk uses different methods to access the data in the target systems. These include agents installed on the target system, "special agents" running on the monitoring server and communicating with the API of the target system, the SNMP API for monitoring, for example, network devices and printers, and HTTP/TCP protocols to communicate with web and internet services. By default, Checkmk follows the "pull principle", i.e. the data is explicitly queried by the monitoring system to quickly identify when a system suddenly fails and does not respond to a "pull". As an alternative, however, a "push" can be configured with which the system transfers its data directly to Checkmk or to an intermediate host.
Data Interface ("Livestatus")
Livestatus is the main interface in Checkmk. It provides live access to all data from the monitored hosts and services. The data is fetched directly from the RAM, which avoids slow hard disk access and gives fast access to the information without overloading the system too much. Access is done via a simple protocol and it is possible from all programming languages without requiring a special library.
Web-GUI ("Multisite")
Multisite is Checkmk’s web GUI. In addition to having a quick page layout, it offers user-definable views and dashboards, distributed monitoring by integrating multiple monitoring instances via Livestatus, integration of NagVis, an integrated LDAP connection, access to status data via web services, and much more. Dashboards and views can be differentiated for various users or groups of users, for example vSphere-specific[22] views for VMware admins. The web GUI is currently available both in German and English.
Web Administration ("WATO")
The Web Administration Tool makes a system based on Checkmk completely administrable via the browser. This includes managing users, roles, groups, time periods, and more. Permissions can be granted in a granular way using a role concept. Existing role-based access controls (LDAP, AD) can be used for this. The WATO works rule-based, so that the configuration remains intuitive even in complex environments, and the necessary effort is low. Automatic discovery and configuration, as well as the automatic agent update further accelerate the configuration process. An HTTP API can also be used to integrate CMDBs for accelerated configuration.
Alert System
Several notification channels can be set up and configured with different rules for each user. For example, emails can be triggered at any time of the day, but notifications via SMS are sent only for important issues during on-call hours. The notifications can be set for all or for specific teams, e.g. notify only the storage admins about a failed hard drive. Duplicate notifications are grouped together so that no user is notified twice through a particular channel. Furthermore, users can configure their own notifications themselves. In distributed environments alerts can be managed centrally. For detected issues, actions can be triggered automatically (alarm control) via scripts. Checkmk includes integrations to email and SMS gateways as well as to communication and IT service-management solutions such as Slack , Jira , PagerDuty , OpsGenie , VictorOps and ServiceNow.
Business Intelligence
The BI module is integrated into the graphical user interface. It aggregates the overall status of business processes, their dependency on complex applications and IT infrastructure elements from many individual hosts and services in a rule-based manner. It can also be used to represent applications made up of microservices, which in turn consist of Kubernetes pods and deployments. In addition, worst-case scenarios can be simulated in real time and historical data can be analyzed to understand the causes of performance degradation.
Event Console
The Event Console integrates the processing of log messages and SNMP traps into the monitoring. It is configured via a flexible set of rules, and decides whether incoming messages are to be discarded or how they are to be classified. It can count, correlate, expect messages, rewrite messages, and more. Similar entries can be grouped into a single event (e.g. multiple failed logins) to keep track of events. It also has a built-in syslog daemon that receives messages directly on port 514, and an SNMP trap receiver that receives traps on port 162.
Metrics Graphing
The commercial Checkmk editions use their own metric and graphing system. This allows time series metrics to be analysed over long intervals using interactive HTML5 graphs. The maximum resolution is one second. Data can be imported from a variety of data sources and metrics formats (JSON, XML, SNMP etc.) and stored on the disk of a long-term data storage device.
Alternatively, Graphite or InfluxDB can be connected via an export interface. From CEE version 1.5p16 there is also a plug-in available for integrating data directly from Checkmk into Grafana for visualization purposes. The Checkmk Raw Edition currently uses PNP4Nagios as its graphing system.
Reporting
Reporting enables the direct delivery of PDF reports, ad-hoc or automatically, at regular intervals. It includes the availability analysis in which the history of the states over any desired time period can be provided with a click. Availability calculations can exclude unmonitored times, adjust the resolution, or ignore short intervals. In addition to the availability calculations, reporting also includes SLA reporting in which complex SLAs can be monitored. The reporting is only available in the commercial versions of Checkmk.
Hardware/Software Inventory
The hardware/software inventory can be used, for example, to monitor hardware and software changes, to verify the presence of installed security updates, and to update static data with dynamic parameters (for example, updating the current disk usage statistics based on monitoring data). The Configuration Management Database (CMDB) i-doit has a deep integration that enables the exchange of CMDB data with monitoring data.
See also
References
- tribe29 GmbH (2020-08-19). "Checkmk stable release 1.6.0p16". Checkmk Announcement.
- "Checkmk EULA" (PDF). tribe29 GmbH. Retrieved 2019-05-31.
- "Use Cases". tribe29 GmbH. Retrieved 2019-06-15.
- "Checkmk Editions". tribe29 GmbH. Retrieved 2015-11-27.
- "Open Source IT monitoring with Checkmk". tribe29 GmbH. Retrieved 2019-07-01.
- "Download version". tribe29 GmbH. Retrieved 2019-07-10.
- "Monitoring Agents". tribe29 GmbH. Retrieved 2019-06-12.
- "Mathias Kettner (check_mk)". Meet The Community. Nagios Enterprises. 2009-08-17. Archived from the original on 2012-01-06. Retrieved 2015-11-27.
- Rieger, Götz (2012-11-03). "Einfach mal Nagios – Netzwerk-Monitoring mit OMD und Check_MK" (in German). c’t. p. 190. Retrieved 2015-11-27.
- Huber, Mathias (2011-03-09). "Nagios-Erweiterung Check_mk in Version 1.1.10" (in German). Linux Magazine. Retrieved 2015-11-27.
- Siering, Peter (2017-05-31). "Monitoring-System Check_MK in frischer Version 1.4.0" (in German). Heise Online. Retrieved 2017-05-31.
- Kettner, Mathias. "The Checkmk micro core (CMC)". Retrieved 2018-12-05.
- "Checkmk community announcement 1.5 Plus(1.5.p12)". tribe29 GmbH. 2019-02-17. Retrieved 2019-07-11.
- "tribe29 - Our Story". tribe29 GmbH. Retrieved 2019-06-14.
- Mueller, Christian (2019-04-17). "Grafana Data Source Plugin". GitHub. Retrieved 2019-07-09.
- "Automatic Service Discovery". tribe29 GmbH. Retrieved 2017-02-17.
- "Monitoring of highly dynamic environments". tribe29 GmbH. Retrieved 2019-05-07.
- "Ansible integration with Checkmk". GitHub. 2019-05-01. Retrieved 2019-05-08.
- "Salt integration with Checkmk". GitHub. 2019-05-02. Retrieved 2019-05-09.
- "Global deployment of Check_MK at Faurecia". 2018-10-23. Retrieved 2018-10-23.
- "EDEKA Vortrag" (in German). 2017-05-12. Retrieved 2017-05-12.
- Heike Jurzik, Marcel Arentz (2019-07-01). "vSphere-Monitoring mit Checkmk" (in German). Linux-Magazin. Retrieved 2019-07-02.
External links
- Official website
- Computer monitoring with the Open Monitoring Distribution (Kelvin Vanderlip, 2012-03-01)
- Using the Open Monitoring Distribution(Nagios) to Monitor Complex Hardware/Software Systems (Joe VanAndel, 2012-03-29)