Events, my dear boy, events

When Harold Macmillan became the Prime Minister of the United Kingdom on the 10th January 1957 he probably didn’t think 54 years later he would appear in a blog about Enterprise Systems Management. However one thing he said during his period of office is just as true for politics as is in the world of IT and specifically monitoring projects. When asked what represented his greatest challenge, Macmillan replied: ‘Events, my dear boy, events’.

So what are the issues?

I think there are 5 main issues with event management that can catch our many monitoring projects.

Issue 1 – Too Few Events

Many customers start off a monitoring project with 2 main thoughts; firstly that they must get a return on investment from their new tool and secondly worrying that every application and operating system is adequately monitored. They are concerned that something will go wrong and will not be picked up by the new tool.

Issue 2 – Too Many Events

This invariably leads to too many events, several of which result in no useful response; in fact they are worse than useless, as an Operator has to spend time acknowledging or closing them without looking at the cause.

Issue 3 – Translating Incomprehensible Events

Imagine a scenario when at 3am an operator at your company receives the following event:

An operator without training will have no idea what to do with this event and will invariably resort to calling out team after team trying to find if the event is important or not. This can be extremely costly as the man hours spent investigating an event like this mount up.

Issue 4 – Understanding the best way to resolve an event

Even when you get an event that is understandable what do you with them and more importantly do 2 different people at your company both know the simplest and quickest way to resolve the issue? If not time and money are wasted.

Issue 5 – Understanding how important an event is to your business

Lastly when faced with a screen like the one shown below which event do you deal with first?

 

It is impossible when faced with a screen of red critical alerts to prioritise the alerts and know which alert is most important to your business. Invariably the operator will start at the top and work his way down however if the first event is a CPU 100% busy alert and the second alert is an “Internet Banking Down” alert this approach invariably fails.

The Solution

So what is the solution? Start planning the events before a single agent is installed and follow these three simple rules:

1/ Decide what events you really need

Think about the action that will be taken when the event comes in. A large number of events can be filtered by applying the rule that if there is no defined action then there is little value in sending it. An example would be the ubiquitous CPU 100% busy alert. If you do not perform any actions when these events arrive then don’t send them. Better to monitor the run queue or load average to see if a system is busy before acting on a CPU busy.

2/ Decide how important the Event is

Define the impact on the business for each event that can come in. To do this you will need to ensure that events contain which application (or business service) is affected in an alert.  This also will result in far better severity definitions.

3/ Document and integrate every alert into a knowledge base

Lastly each event should have brief and easily accessible details of how to resolve the problem or at least details of who to pass it to. We call this information a Message Catalogue however it I have also seen the name Event Catalogue used.

There are 5 reasons why I would suggest creating a message catalogue:

  1. Firstly it documents the alerts that a system or application is capable of generating during its operation
  2. Secondly it provides support teams with information that enables them to respond to incoming alerts in a timely, appropriate and consistent fashion.
  3. This empowerment of support staff reduces the mean time to recovery of incidents and off-loads some of the burden from second line teams.
  4. It helps in the skilling of otherwise untrained staff.
  5. And importantly it ensures that important service affecting alerts are identified and dealt with first

Orb Data uses a message catalogue based on dokuwiki technology (see below for an example) but you can use your own tools for this such as Microsoft SharePoint but whatever you choose it is imperative that a well defined and structured format is applied consistently across all entries.  The format for entries in the Orb Data Message Catalogue is as follows:

  1. Message Text – the alert as it will appear to the operator
  2. Severity – the severity of the alert
  3. Description – more detail about the alert including what has generated it, why it has been generated and what the implications are
  4. Source – The source of the event
  5. Business Impact – what impact the problem represented by the alert is likely to have on the immediate service and the business as a whole.  This information is of vital importance as it will help the operator to understand the significance of the alert and therefore the priority that should be attributed to it
  6. Required Action – how the operator should respond to this alert.  Typically includes what first line support actions should be taken and any cross references to the Known Error Database
  7. Escalation – how the operator should escalate this alert to the relevant support area, e.g. raise an incident, page on-call etc
  8. Owner – each entry in the Message Catalogue has a designated owner who is responsible for the content.  This is a key field as it provides a direct contact, should details regarding the message need to be clarified or updates suggested
  9. Version History – how this entry in the Message Catalogue has been changed over time

We also ensure that all of the tools where events can be viewed are integrated into the Message Catalogue which means tools like ITM 6 and the Netcool Tivoli Integrated Portal are both modified as shown below:

TIP

ITM 6

The last point about the Message Catalogue is that you must ensure that all events that do not have entries are captured. The aim should be that all pages have an entry as soon as possible.

If this all sounds to complicated or just too much work we will be happy to help you defining which events are needed for your business and integrating them into a Message Catalogue. We can also supply the Orb Data Message Catalogue on your own hardware or alternatively providing an Orb Data Message Catalogue as an Appliance. This VMware appliance runs on a Linux operating system and can be plugged into your VMware estate and configured.

If you follow the simple rules I have defined, create and ensure the use of a Message Catalogue you will (As Harold McMillan once said) never had it so good.

Visits: 15