Danshari – The age of minimalist monitoring

Danshari is a Japanese concept meaning to declutter. It is formed of 3 ideograms (断捨離), meaning “refuse”, “dispose” and “separate”.

I’m reading a book at the moment called Goodbye Things by Fumio Sasaki in which he shares his personal minimalist experience, offering specific tips on the minimizing process and revealing how the new minimalist movement can not only transform your space but truly enrich your life. Whilst I was reading I thought the same philosophy could be applied to Monitoring. I’ve written articles before where I have touched on this but now with the advent of some new tools I think it is worth discussing this again.

How can we apply minimalism to monitoring?

The point of any monitoring tool is to alert your company when some important technology is malfunctioning or will malfunction if the issue is not dealt with. It sounds simple, but I don’t see this being achieved that often. The issue is that companies often receive too many alerts, monitor too many things and frequently choose the wrong thresholds. The point of minimalist monitoring is to receive only events that matter to your business and receive them as quickly as possible whilst spending the least amount of time configuring the tools.

Using the Danshari philosophy let’s look at how this can be achieved.

Refuse

The worst issue that I see is too many events. If an event is received and the action is purely to delete it then this is worse than useless, as an Operator must spend time acknowledging or closing them without looking at the cause. The solution to this is to think about the action that will be taken when the event comes in. Many events can be filtered by applying the rule that if there is no defined action then there is little value in sending it. An example would be the ubiquitous CPU 100% busy alert. If you do not perform any actions when these events arrive then refuse to add them in the first place. In this particular case it is better to monitor metrics like the run queue or load average to see if a system is busy before acting on a CPU busy.

Dispose

The second issue that causes too many alerts is incorrect thresholds. There are teams of people who spend their whole careers raising and then lowering thresholds. The reason for this is that value you choose may be right for the day that set it but the next day may bring a different work pattern and the threshold is once again incorrect. For example, take a credit card company payment authorisation system. This would expect to be busier on Saturday than on a Monday whereas a company’s email system may be expected to be busier on a Monday at 9am as people come into work and catch up with emails that have been sent since they left on Friday. How do you set one threshold for these scenarios?

The answer is to use a product such as IBM’s Predictive Insights that learns the behaviour of all the metrics you want to monitor so that you are only alerted if the performance is outside the expected range. Those thresholds will be dynamic across the business day and week resulting in fewer false alerts yet ensuring problems are not missed due to thresholds being set too high. It doesn’t replace your other monitoring tools, but it makes the setting of thresholds redundant and the disposes of the events that would otherwise clutter the operator’s dashboard that are just representing normal business behaviour.

It is also important to note that it is not just high spikes that should be detected as looking for as low usage of a trading floor system at a busy time is just as much of a concern as an over-used and slow system.

Separate

When you start any monitoring project you may have 2 main thoughts; firstly, that you must get a return on investment from your new tool and secondly worrying that every application and operating system is adequately monitored. However, this often leads to monitoring too many metrics and consequently receiving too many alerts. The ideal solution would be to feed multiple data feeds from your tool (or tools) and let an overarching product separate what is important from the rest of the noise.

Tools like IBM’s Predictive Insights learn relationships across technology boundaries. It analyses metrics and decide which ones have a direct relationship to each other and what is standard behaviour no matter what tool is providing the data. For example, Predictive Insights may find a relationship between metrics for a web service response time and database pool hit ratio even though the data may be coming from sources. Alerts will be generated if there is an abnormal behaviour within those learnt relationships and not on the individual metrics.

The combination of more appropriate thresholds and monitoring of metric relationships means that application and infrastructure problems are detected earlier than with traditional monitoring and with zero configuration effort from the ESM Team and the subject matter experts. This early warning enables problems to be resolved before they cause an incident.

Analytics can also help reduce the number of events by consolidating alerts across local, cloud and hybrid environments into a small number of actionable problems. Products like IBM’s Netcool Operations Insight apply analytics to evaluate historical events to drive efficiency in operations, and identifies problems faster with richer context. For example, Seasonality Analytics or seasonal event identification allows you to detect events that are reoccurring regularly at a particular “hour of day”, “day of week” and “day of month”. This in turn helps recognise why a large number of alerts are occurring due to events like backups, maintenance or just busy periods and adjust your maintenance and monitoring accordingly.

How should you start your journey to minimalism?

If your company is one of the many that suffer from too many alerts or spend too much time setting up the monitoring tools then it is time to start your journey to Danshari. If you would like a discussion how we can audit your current monitoring environment (regardless of the tools you use) and discuss how we can help reduce the events to a more manageable number then please contact me at simon.barnes@orb-data.com

Visits: 1376