Predictive Monitoring – a worked example

I’ve blogged about predictive monitoring technologies before however I think there is now evidence that the tools are starting to be used as part of an IT team’s core toolset. The big change for me is that large software vendors such as IBM are now taking the tools very seriously to the extent that they are starting to integrate them with their own monitoring and Manager or Manager tools. IBM has even gone as far as to integrate Predictive Insights into their SaaS solution – Service Engage – and it is this product that I plan to show in this blog.

For those of you used to some of the older IBM products the new Application Performance Monitoring dashboard might come as a pleasant surprise. This is the same GUI that is available in IBM Tivoli Monitoring 8.1 which has just been released.

As you can see below you can now easily categorise your alerts by application or service and within each container you can easily see what has caused the alert.

DashBoard

If we look closer at a service Credit Card Processing you can see that it has 2 icons; Components and Events. Each icon has a small symbol in the bottom right indicating the highest level of alerts that exist in that component. In this case both levels are at Warning (Yellow triangle with an exclamation mark). Looking closer at the Events icon we can also see it has a red diamond symbol which indicates that this particular item has a Predictive Insight alert associated with it.

ServiceView

Now if we click on the Events icon we see the event view. This is similar to the Situation Event Console in ITM 6.x and shows the events that have caused the icon to change its state.

Events

In this case there have been 3 events; 1 UNIX alert relating to system memory, 1 WebSphere Response Time alert and lastly a Predictive Insights Anomaly alert shown once again by the small red diamond. If we click on this alert we can see the alert details (as shown below).

EventDetails

The events details pane shows more information about the event and next to the Severity has a link called “View anomaly analysis” which launches Predictive Insights.

Anomaly

If we click on this link we can see a graph showing the response time (in black) of the Credit Card application. You can also see a green band going across the graph which shows the baseline (or expected trajectory) of the metric. And lastly you will notice 3 pink vertical lines which indicate the anomalies that have been detected. Let’s look closer at the latest anomaly.

AnomalyExpand

By clicking on the graph and highlighting the affected area we can expand the graph to see the issue in more detail. You can clearly see that the response time has significantly diverted from its expected path causing the alert.

One other possible cause of an alert in Predictive Insights is for metrics to vary in relation to each other over a period of time. This method is called multivariate analysis which allows Predictive Insights to analyse a number of variables and to detect if one is moving away from the others in an abnormal way. This provides greater accuracy than simple analysis of a single metric as it can detect that if all metrics move in a certain way the system may just be busier than normal and not actually failing. An example of this is shown in the graph below which shows that not only has responseTime increased but PoolHitRatio has decreased and BlkWrtnPerSec has gone up. This variation of multiple metrics from their normal paths shows there is likely to be a significant issue.

MultiVariant

To show that PoolHitRatio has in fact decreased we can press the Baseline button next to the metric to change the green area to this metric.

Metrics

Once we have done this we can clearly see that PoolHitRatio has plummeted at the same time as responseTime has increased.

Baseline

Hopefully this has given you some idea of how this can work in reality and will give you the following benefits:

  • Proactively identify problems before they become service impacting
  • Eliminate the costly problem of setting and maintaining static thresholds
  • Speed up the process of identifying the root cause analysis

If you would like to know more or would like a more detailed discussion about the product send me an email and I will be glad to help. Alternatively if you have a service that you know you have issues on and would like Predictive Insights installed in your environment we can work with you to scope a Proof of Concept.

If you want to try this demo for yourself you can access the IBM Service Engage demo at this page here.

Visits: 79