How to monitor Docker?

UPDATE: Part 2 of the blog can be found here

Docker has shot to fame in the last couple of years and it has sparked quite a lot of attention. It has joined the elite “buzz word” club and being touted as revolutionary but if you asked the more seasoned UNIX admins, they would argue that containerization is nothing new. Whilst that may be true, Docker has made containerization accessible to a wide audience and it’s building a rather active and growing community around it. Google Trends shows us the search trends for Docker over the last few years:

In this blog, I want to concentrate on sharing my findings on how you could monitor Docker. I will be primarily looking into self-hosted options (I will explain why), how these tools alert and visualizes the docker performance data.

I will also quickly touch some interesting tool sets I came across during my quest, these tools don’t necessarily satisfy all of the monitoring requirements but are noteworthy.

Why not attend our Docker Fundamentals course?

Datadog

Why not to SaaS…

If you are looking for a SaaS monitoring solution then Datadog is worth looking into, it is very simple to implement (run a container within your docker host, this is literally a single command that needs to be run). I have looked into other SaaS alternatives and nothing comes close to the ease of implementation or the level of information you get from this service.

Datadog Docker Dashboard

However as all good things in life, it does come at a cost. At the time of writing this blog it was $180/host/year (billed annually) or 18$/host/month (billed monthly). So, If you have a small environment and these numbers don’t trouble you, then this is the easiest way to get to Docker monitored (albeit, this would be siloed off from any monitoring solution you already use). For a larger estate, these numbers can quickly add up to some expensive bills and so it’s worth looking at self-hosted options.

cAdvisor

The first of self-hosted options I came across was ‘cAdvisor‘. It’s a ‘Metric Explorer’ and shows the real-time performance of the Docker environment. It’s quite simple to launch, simply run the command below to launch it as a container,

docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:rw \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  google/cadvisor:latest

Above command will expose the front end on port 8080, so visiting http://localhost:8080 will log you in to cAdvisor.

cAdvisor

It has every conceivable metric about the docker host and its containers, neatly graphed in a single interface. However this tool has no ability to alert or trend beyond a certain limit (60 second default, can be increased in the settings but need external database like InfluxDB to store long term data) and so it’s not really a monitoring tool but it’s definitely worth checking it out for a quick insight into the Docker world.

cAdvisor does expose these metrics through it’s own APIs and I noticed more than one of the monitoring tools using this to then pull data into their systems to alert on, at first it seemed like a long way around getting the data into your monitoring environment. It’s an extra workload on your docker host that could be avoided by using the already exposed Docker API or CLI, however the ease of grabbing all metrics from a singular source does make it appealing.

Sysdig

The next tool I looked into was Sysdig, whilst this isn’t strictly meant for Docker or performance monitoring, it does expose some Docker performance stats and is meant to be all of the UNIX debugging tools rolled into one. Their website states:

Think of sysdig as strace + tcpdump + htop + iftop + lsof + awesome sauce.

Which, to be fair, it is…

To install this on the host, just run the below command,

curl -s https://s3.amazonaws.com/download.draios.com/stable/install-sysdig | sudo bash

You can then start the Sysdig interface by running below command,

csysdig

You start of with a ‘top’ style process viewer however the interface responds to mouse clicks! You can view the container level details by hitting on ‘Views’ or F2 on your keyboard which opens a menu and navigating to ‘Containers’.

And whilst this does provide with the insights into containers, it’s not something we can use to alert on and like cAdvisor this really only a ‘Metric Explorer’ when it comes to docker related stats. I don’t want to deep dive into Sysdig’s other capabilities, needless to say it’s quite a useful debugging tool and one that will be on all my servers in the future.

Prometheus

Next, I wandered into Prometheus, it looked promising as it seemed to do what I was looking for… it trends and alerts! Prometheus is based on a push rather than pull type monitoring and what that means is that it needs an ‘exporter’ to send the data in.

Prometheus publishes their exporters as containers as well, you can find them on the Docker hub under the ‘prom’ handle.

However, ‘container-exporter’ is now deprecated and the github project advised to integrate with aforementioned cAdvisor.

I launched the Prometheus server, which I did by running below:

docker run -d --name prometheus-server -p 9090:9090 prom/prometheus

I then logged on to the prometheus-server, by running:

docker exec -it prometheus-server /bin/sh

Edited the /etc/prometheus/prometheus.yml file to include the cAdvisor container, in the target_groups > targets section add below (obviously replacing the CADVISOR_CONTAINER_IP and CADVISOR_CONTAINER_PORT variables with the actual IP address and ports respectively):

 target_groups:
 - targets: ['localhost:9090','CADVISOR_CONTAINER_IP:CADVISOR_CONTAINER_PORT']

Restarted the ‘prometheus-server’ container and visiting http://host_ip:9090 exposes below GUI:

Prometheus

The drop-down menu has a few ‘container’ related stats. Prometheus recommends using Grafana or PromDash for the front end to visualize the data. I quite liked the idea of using Grafana as it is like Prometheus, in that it takes various data source types and not just restricted to Prometheus and so I opted for that.

You can run Grafana by running the below command,

docker run -d -p 3000:3000 \
    -v /var/lib/grafana:/var/lib/grafana \
    -e "GF_SECURITY_ADMIN_PASSWORD=secret" \
    grafana/grafana

If you are used to Grafana then this should be trivial however it can take some time to build these into a fully fledged dashboard, luckily there is a ton of documentation for Grafana.

You can add Prometheus as a data source in Grafana, once that is done I built my queries on Prometheus first as there was a console where I could see the output on and fine tune / filter out the noise before running it on Grafana. My dashboard ended up looking something like below:

Grafana + Prometheus

Alerting was my next order of business and below is an excerpt from their docs:

“Alerting with Prometheus is separated into two parts. Alerting rules in Prometheus servers send alerts to an Alertmanager. The Alertmanager then manages those alerts, including silencing, inhibition, aggregation and sending out notifications via methods such as email, PagerDuty and HipChat.”

To start AlertManager as a container, run the below:

docker run -d -p 9093:9093 --name=prometheus_alertmanager prom/alertmanager

Next, we need to edit the /etc/alertmanager/config.yml file to set notification receiver (I used email, however you can configure Pagerduty, Opsgenie, Slack or Webhook for any other third party), login to the alert manager to update this file:

docker exec -i -t prometheus_alertmanager /bin/sh

I updated the smtp server host to a mailhog address to capture the emails without actually sending them. I kept everything else as a default as there were enough routes there to try…

Next step is to create the alert rules, the path to the rules file needs to be updated on the prometheus server config (/etc/prometheus/prometheus.yml). Just add the below to rules_files section:

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - "/first.rules"

What I also found was that I needed to launch the prometheus server with a tag to point to the Alertmanager IP address and port.

The alert rule I created was called ‘first.rules’, on the docker host I stored it under /root/first.rules with below content:

ALERT HighMemoryAlert
 IF container_memory_usage_bytes > 100
 FOR 1m
 LABELS { severity = "page" }
 ANNOTATIONS {
 summary = "High Memory usage for container",
 description = "High Memory usage for container on {{$labels.instance}} for container {{$labels.name}} 
(current value: {{$value}})",
 }

I then launched the prometheus server by mapping the above rules file and pointed to alertmanager using the ‘-alertmanager.url’ tag as seen below:

docker run -d --name prometheus-server -p 9090:9090 -v /root/first.rules:/first.rules  prom/prometheus 
-alertmanager.url=http://172.17.0.6:9093 -config.file=/etc/prometheus/prometheus.yml

And finally logging back into Prometheus server, under ‘Alerts’ tab you should see the alert we defined in first.rules and some alerts firing (I set it low enough):

This is also forwarded to AlertManager, go to http://ALERTMANAGER_IP:9093 and click on ‘Alerts’ tab, mine looked like below:

Here you can see that the default logic defined in alertmanager.conf sends it to team-X-mails with summary and description that we defined in first.rules file. Obviously, this can do with a lot more fine tuning but this proves that we can alert…

In-house monitoring / Nagios XI

Unless you were already using one of the above tools for other monitoring requirements, using it to monitor Docker would essentially mean it would be siloed off on its own and you have yet another monitoring tool to manage and also deviates from the holistic monitoring approach that would help expose issues surrounding Docker.

Besides, unless your enterprise runs on nothing else but Docker you need a monitoring tool that can preferably do a bit more than just monitor Docker. Prometheus does do more than just Docker but no where near what Nagios-esque tools are capable of. This is what drove me to find out how I can get Docker performance data into the monitoring tool of my choice, Nagios XI, which is a front end to Nagios Core but as it comes from Nagios themselves it’s tightly integrated. If you haven’t been following XI updates recently, you would want to know that XI exposes an API into Nagios and this makes programmatically adding hosts and services into Nagios a more appealing and manageable option. Although these next steps would be in the context of Nagios, the concepts explored can be applied to other monitoring solutions.

Docker has a Remote API that allows for performance data to be gathered, note that this isn’t the only way to go about doing this, docker containers are similar to LXC containers and hence rely on Linux control groups and the metrics can be retrieved from pseudo-files on the host system. On my CentOS 7 server, under /sys/fs/cgroup, I had the below structure:

├── cgroup
│   ├── blkio
│   ├── cpu -> cpu,cpuacct
│   ├── cpuacct -> cpu,cpuacct
│   ├── cpu,cpuacct
│   ├── cpuset
│   ├── devices
│   ├── freezer
│   ├── hugetlb
│   ├── memory
│   ├── net_cls
│   ├── perf_event
│   └── systemd

For example, one of my containers running (with Container ID: 592e073c1ef4631c84d05e3e8f8da130c3037f2701bb7862951d79798cf9000a) would have it’s memory stats would be living under:

/sys/fs/cgroup/memory/docker/592e073c1ef4631c84d05e3e8f8da130c3037f2
701bb7862951d79798cf9000a

with a structure like below:

.
├── cgroup.clone_children
├── cgroup.event_control
├── cgroup.procs
├── memory.failcnt
├── memory.force_empty
├── memory.kmem.failcnt
├── memory.kmem.limit_in_bytes
├── memory.kmem.max_usage_in_bytes
├── memory.kmem.slabinfo
├── memory.kmem.tcp.failcnt
├── memory.kmem.tcp.limit_in_bytes
├── memory.kmem.tcp.max_usage_in_bytes
├── memory.kmem.tcp.usage_in_bytes
├── memory.kmem.usage_in_bytes
├── memory.limit_in_bytes
├── memory.max_usage_in_bytes
├── memory.memsw.failcnt
├── memory.memsw.limit_in_bytes
├── memory.memsw.max_usage_in_bytes
├── memory.memsw.usage_in_bytes
├── memory.move_charge_at_immigrate
├── memory.numa_stat
├── memory.oom_control
├── memory.pressure_level
├── memory.soft_limit_in_bytes
├── memory.stat
├── memory.swappiness
├── memory.usage_in_bytes
├── memory.use_hierarchy
├── notify_on_release
└── tasks

There is also the docker CLI (using “docker stats” command, for instance) to retrieve some crucial stats. Between these three methods, we should have all the performance data we need. I am leaning towards using the API and CLI tool as much as I can, although the pseudo files are theoretically much quicker to query, it’s a local resource and forces me to run the script on the docker host itself… which I want to avoid if possible.

By default the docker daemon listens on a UNIX socket, you can expose this to a TCP port by editing the docker service which initiates the daemon. I had my setup running on CentOS 7 and the version of cURL was below 7.40 which doesn’t support UNIX sockets, this seems to be an issue on CentOS 7 however on Ubuntu you should be able to use cURL with UNIX sockets, you can also use netcat to get around this. In the interest of keeping my GET calls simple and also be able to remotely query this URI, I decided to expose the TCP port instead. The the docker service definition lived in the below file:

/usr/lib/systemd/system/docker.service (CentOS 7)
/lib/systemd/system/docker.service (Ubuntu 15.10)

Edit the ExecStart variable as below,

ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2375 -H fd://

Here is the first “-H” argument is exposing the TCP port and the second is exposing the UNIX socket.

Restart Docker,

service docker restart

You can now do a quick cURL command as below,

curl "http://127.0.0.1:2375/containers/json"

Which will give you the output of all containers present in JSON format, a snippet of which I have included below:

[
        "Image": "google/cadvisor:latest",
        "ImageID": "sha256:4bc3588563b107ed7f755ddbbdf09cccdb19243671d512810da3ed0fef6f7581",
        "Command": "/usr/bin/cadvisor -logtostderr",
        "Created": 1463848238,
        "Ports": [
            {
                "IP": "0.0.0.0",
                "PrivatePort": 8080,
                "PublicPort": 8080,
                "Type": "tcp"
            }
        ],
        "Labels": {},
        "State": "running",
        "Status": "Up 23 hours",
        "HostConfig": {
            "NetworkMode": "default"
        },
        "NetworkSettings": {
            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "",
                    "EndpointID": "ecbca5db9df84d588a78d6a87fccfa4e491e8aed0d5bdc803a018c0184c0e75e",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "02:42:ac:11:00:02"
                }
            }
        },
        "Mounts": [
            {
                "Source": "/",
                "Destination": "/rootfs",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Source": "/sys",
                "Destination": "/sys",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Source": "/var/lib/docker",
                "Destination": "/var/lib/docker",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Source": "/var/run",
                "Destination": "/var/run",
                "Mode": "rw",
                "RW": true,
                "Propagation": "rprivate"
            }
        ]
    }
]

N.B.: The CLI output is quite difficult to read, use some online viewer like (http://codebeautify.org/jsonviewer#) to make things easy to read.

Straight away you can see a wealth of information, you can use a language of choice to dig through these as you wish. I used perl to parse the JSON and constructed it to what I wanted:

# ./docker_containers.pl 
Name|Image|State|Status|IP|Gateway
/prometheus_alertmanager|prom/alertmanager|running|Up Less than a second|172.17.0.6|172.17.0.1
/pedantic_ramanujan|prom/container-exporter|running|Up Less than a second|172.17.0.2|172.17.0.1
/stupefied_lalande|centos:centos7|running|Up Less than a second|172.17.0.4|172.17.0.1
/cadvisor|google/cadvisor:latest|running|Up Less than a second|172.17.0.5|172.17.0.1
/cranky_torvalds|mailhog/mailhog|exited|Exited (2) 2 seconds ago||
/prometheus-server|prom/prometheus:0.19.1|running|Up 3 minutes|172.17.0.3|172.17.0.1

Obviously, this isn’t going to cut the cheese for my monitoring project but it got me started with playing with the JSON data. The idea is to start the script on a docker host, which should then detect the running containers, create an entry in Nagios XI for the host and related services (containers) and their statuses.

With this in mind, I need the following in place:

Script to create a Nagios host and services based on running containers.
These dynamically created check services for each container then keeps track of their status by regular polling.

If you don’t have a Nagios XI instance running, I created a Nagios XI container that you could use. Just run the below:

docker run --privileged -d -p 80:80 -p 443:443 logix88/nagiosxi /usr/sbin/init

Go through the first steps to initialize your instance and once you are logged in, click on ‘Help’ tab that explains the API in a bit more detail. The examples actually have the IP address and API key of your instance, so just copy/paste and you are good to go.

First, we need a check command created in Nagios for a simple check for whether a container is running, I wrote mine that shows the following outputs:

A stopped container would return a ‘Critical’ state:

# ./check_docker.pl 192.168.1.10 2375 f664b958126a
CRITICAL - loving_aryabhata exited with exit code: 137

A running container would return an ‘OK’ state:

# ./check_docker.pl 192.168.1.10 2375 7a04f6a837fa
OK - prometheus_alertmanager IP: 172.17.0.3 is running

Next we need a script to dynamically create the docker host and it’s containers as services for the docker host. Each of these services will use the above check_docker command to check it’s current running state.

To add a new host into Nagios XI, we need these fields and sample values:

Required Parameters	Value Type
host_name	dockerhost
address	192.168.1.10
max_check_attempts	2
check_period	24×7
contacts or contact_groups	nagiosadmin
notification_interval	5
notification_period	24×7

You make the following POST call, as shows below:

curl -XPOST "http://nagiosxi-host/nagiosxi/api/v1/config/host?apikey=9akpavnh&pretty=1"
-d "host_name=dockerhost&address=192.168.1.10&check_command=check_ping\!3000,80%\!5000,100%
&max_check_attempts=2&check_period=24x7&contacts=nagiosadmin&notification_interval=5
&notification_period=24x7&applyconfig=1"

Similarly, for adding a service we need below parameters:

Required Parameters	Value Type
host_name	dockerhost
service_description	$container_name
check_command	check_docker!$docker_port!$container_id
max_check_attempts	5
check_interval	5
retry_interval	2
check_period	24×7
notification_interval	5
notification_period	24×7
contacts or contact_groups	nagiosadmin

The POST call for this would look like below,

curl -XPOST "http://nagiosxi.demos.nagios.com/nagiosxi/api/v1/config/service?apikey=9akpavnh&pretty=1" 
-d "host_name=dockerhost&service_description=$container_name&check_command=check_docker\!$docker_port\
!$container_id&check_interval=5&retry_interval=5&max_check_attempts=2&check_period=24x7
&contacts=nagiosadmin&notification_interval=5&notification_period=24x7&applyconfig=1"

The script I wrote, dynamically creates the container check services and produces an output similar to below,

# ./docker_create.pl 192.168.1.10 2375
Successfully added dockerhost to the system. Config imported but not yet applied.
Container Name: prometheus_alertmanager Id: 7a04f6a837fac26d500fe86b2ebbfc1754cedd922ba5f76ef0b94734df943446
{
    "success": "Successfully added dockerhost :: prometheus_alertmanager to the system. 
               Config imported but not yet applied."
}

Container Name: pedantic_ramanujan Id: fb93a1ed6945d329488ad348fc9ec5478daf31a67982ff21c6cbcc34f884a658
{
    "success": "Successfully added dockerhost :: pedantic_ramanujan to the system. 
               Config imported but not yet applied."
}

{
    "success": "Apply config command has been sent to the backend."
}

and finally on Nagios XI, this would look like below:

This was obviously a simple proof of concept, solutions would need polishing to be used in production and you would also need a lot more attributes collected to fully monitor Docker but the ideas and the concepts remain the same… it just needs to be extrapolated as required.

I will be exploring monitoring Docker with IBM Application Performance Management (APM) / IBM Tivoli Monitoring in more detail for the next installment of this blog, I will be looking to gather all the performance metrics discussed in the previous sections for the host and containers along with creating Dashboards on these tools.

If you would like to discuss monitoring options for your docker estate, feel free to contact me either via email at: hari.vittal@orb-data.com

Datadog

Why not to SaaS…

cAdvisor

Sysdig

Think of sysdig as strace + tcpdump + htop + iftop + lsof + awesome sauce.

Prometheus

In-house monitoring / Nagios XI

Share this: