Last updated on March 16th 2011 by Simon Barnes
As a consultant the most common question I get asked at the start of any monitoring project is what should I monitor? Of course I can advise on the best metrics for such things as DB2 or Solaris however for any monitoring solution to really achieve the ROI that it was bought for you need to ask a very different set of questions namely “What are the KPIs for the solution you want to monitor and how can they be measured?”
Or put more simply:
- “What is happening within the infrastructure?”
- “How does this relate to the business service, e.g. Internet Banking?”
So let’s start at the beginning with a definition of what KPIs actually are:
KPIs (Key Performance Indicators) are Operational, Line of Business, and financial metrics that reflect the strategic performance of an organization such as banking transactions, processed orders, failed transactions and transaction response time.
As these examples show a KPI should be business focussed and objective and for each service you are planning to monitor they will probably be different. They will also probably break down into Performance KPIs and Availability KPIs.
For example Equities Trading KPIs might be:
- Transactions completed online (Performance)
- Transactions passed to trading floor (Performance)
- Online trading application performance (Performance)
- Online trading application availability (Availability)
Whereas a bank may use the following KPIs:
- Teller, ATM, Retail Banking Transactions completed (Performance)
- Avg. Response Time by Transaction Type (Performance)
- Failed Transactions (Availability)
- Revenue from transactions (Performance)
- Operational Penalty for application downtime and severe performance degradation (Performance)
Measuring this sort of information ensures that IT is a business tool and not a business burden.
At this point I can hear the system administrators and DBA’s throwing their hands in the air and complaining that such monitoring is all very well but it doesn’t help with the running of IT. Firstly this is of course nonsense as it is the business that pays for the IT in the first place and secondly KPIs are not restricted to business services. They can (and should) also be applied by Administrators to the systems they run. For example a Microsoft Exchange administrator may choose the following as KPIs:
- Size of email processed by server and region
- Internal and external Messages transferred
- Average internal & external email transfer times
- Failed transfers
Similarly a UNIX/Windows administrator who is running a virtualized environment could choose:
- LPAR and Virtual Machine Utilization
- Physical server or mainframe utilization
- Efficiency achieved through virtualization
- Virtual Instance and physical device availability
There are of course hundreds more examples I could give but I think you are probably getting the idea.
One thing you must always bear in mind when thinking of the KPIs is that they must be measurable using the tools that you have. A KPI that you cannot measure is not worth having.
If you have SLAs in place these can be used to guide what KPIs need to be measured as an SLA will contain a list of items that are important to your customers(internal or external) and potentially have financial penalties as well.
You will have noticed that some of the KPIs measure things like percentage improvement. For example a KPI for an Incident Management initiative could be “Percentage increase in the Incidents resolved by first line operatives.” To measure this you will need a baseline of performance so if you do not have this already you will need to give yourself time to gather this before the KPIs can be accurately measured.
Before I finish I wanted just to leave you with a quick note about KQIs. While KPIs are Operational, Line of Business, and financial metrics, KQIs (Key Quality indicators) are relative measures of various KPIs, or aggregation of various KPIs that relate business performance to application performance. Some examples are Banking Transaction Revenue when slow response times reported and failed transactions due to application unavailable. For these to be measured KPIs need to be in place before they are measured and so should be considered as secondary.
So am I saying don’t measure Disk IO, memory etc? No I am not but these should be seen as facilitators to measuring KPIs and not the other way round. Make KPIs the priority when you start the project.