How do you visualise alerting in the context of business services?
This is not a new question but it has always been a difficult one to solve and with the shift to cloud based Service Management solutions like ServiceNow the goal posts are changing again.
In this blog I will take a look at how you can bring data from your ServiceNow CMDB to your OMNIbus events and use the Dashboard Application Service Hub (DASH) topology view to visualise the service component hierarchy and the impact of those events.
The diagram below shows the data flow. We take data from the ServiceNow CMDB and store it in a local topology cache database. This cache will then be used for enriching events in OMNIbus, modelling component severities using Impact and passing the data to DASH through the Impact UI provider.
NOTE: There are Netcool and TCR licensing restrictions on how you connect and use data from external sources. Please confirm with IBM how this affects you to make sure you are compliant before starting on this kind of development.
Building the Topology Cache
The first stage is to get data out of ServiceNow into a format that we can consume easily from Netcool Impact. ServiceNow provides an extensive REST API that means it is fairly simple to build a script to drill through the data we need and load it up into a local database.
I should explain why we chose to use a local database to cache the ServiceNow CMDB data before processing it using Impact. Impact does have a Web Service data source adapter (DSA) that can be used to query Service Now REST API on demand, however there were a number of reasons we chose the intermediary database:
- The CMDB data can be transformed and pre-processed to a more suitable format and structure
- We also wanted to use the CMDB data for TCR/Cognos reports
- This will reduce the volume of ServiceNow request traffic
- We anticipated a performance issue for queries to the cloud vs. within the data centre
- The CMDB data does not change very often so a cache is suitable
- This provides resilience against internet connectivity issues
As a result, we didn’t use the DSA for the ServiceNow connection. If you are interested in that approach then Neil Richards has written an excellent article Use Netcool to Leverage Service Now Discovery and CMDB Data that discusses how to do it efficiently.
The screenshot below is from a sample instance of ServiceNow showing configuration items and their relationships. This is the type of data we are querying.
- Discover and drill down all key business services
- Identify the primary business area, service, application and component for each server and cluster – this will be used for event enrichment and reporting
- Generate topology indices for the topology widget in DASH
- Assign sensible defaults for severity modelling, e.g. average, maximum, weighting
- Housekeep the data to account for renaming, change of use, change of structure or decommissioned systems
I will discuss the REST API calls to ServiceNow in more detail in a follow up blog.
The ServiceNow extraction script has done the hard work of identifying business area, service and other attributes for each configuration item. This is populated into a lookup table in our topology cache so that it becomes a fairly standard Netcool Impact task to enrich events with this information.
We added the following fields to the Omnibus database for enrichment:
- CMDB Business Area
- CMDB Service
- CMDB Application
- CMDB Component
- CMDB Server
- CMDB Flag
The flag field is used as a control field for the Impact service and policies to control processing.
The severity calculation is another task performed by Netcool Impact. The calculation is designed to look at current Omnibus events in real-time and apply a severity to the associated CMDB component. The policy then moves up the topology tree applying severity modelling at each level until it reaches the top level business service. For those of you familiar with IBM Tivoli Business Service Manager, some of the concepts in the process are very similar.
An Impact service is set up to run the calculation policy on a regular schedule. We chose a 1-minute cycle but you should choose the timing carefully considering how busy the process is likely to be. With a mind on performance it is important that the calculation policy is optimised to recalculate nodes only when there is a change to dependent nodes lower down the tree.
It’s not good enough just to propagate worst case severities up the topology. We want to take into account the resiliency of load balancing or clustering, and to treat events differently depending on their specific impact to availability or performance. So we need to build flexibility into the modelling, and make it easy to tweak and configure without recoding policy logic. To achieve this we added a number of features to the design:
- Active and inactive handling for resilient resources
- A rules table to apply a severity multiplier to particular events
- A weighting for each child node in the tree
- Various severity models to choose for each node – including maximum, average and weighted average
The end result of the calculation is stored back to the Topology cache as a severity value ranging from 0 to 100 for each configuration item.
The dashboard is implemented in Dashboard Application Service Hub (DASH) which is part of IBM Jazz for Service Management and is also the host application for the OMNIbus 8.1 Web GUI.
DASH connects to one or more UI data providers to populate information into the dashboard pages. As we have the data in an external database there are typically two choices of data provider:
- Netcool Impact
- Tivoli Directory Integrator (TDI)
In this solution it was most convenient to use the Impact UI data provider that is already connecting to and enriching the data in the topology cache. All that was needed was to develop a few additional Impact policies to serve up the data that we need. I’ll going into more detail on this in another blog article but a couple of the main policies were:
getCMDBTopology – This retrieves the CMDB hierarchy of items for an entire Business Service. The key part here is to return the data in the correct format for the DASH topology widget. This means supplying a topology ID for each item, linking it to its parent ID and returning all the items in ID order. It’s actually slightly awkward to generate these IDs using Impact Policy Language so we actually generate them during the ServiceNow extraction in the perl script.
The other data we need is the severity data and node status which is also generated separately by the severity calculation discussed earlier.
So, with all the heavy lifting done beforehand, this policy actually only needs to read out from the topology cache and return the data.
getCMDBApplications – This method returns a single layer of items of the topology, which is useful for showing in a list widget in the DASH pages.
Another useful widget we can use is the Event Viewer which is supplied by the DASH integration with OMNIbus. The enrichment we apply to the OMNIbus events includes CMDB Service and CMDB Application which makes it easy to create transient filters for the Event Viewer to show appropriate events on each Service page.
Taking the Solution Further
As I mentioned already I’ll be going into more of the technical details of this solution in follow up blogs. For now, I’ll finish this blog with a few ideas on how this solution can be extended further.
Historical performance and availability statistics and reports. One of the requirements for this project was to be able to report the health of business services and applications on a monthly basis. We were able to deliver this rapidly by building a custom ITM agent to read the topology cache. ITM has the built in capability for data warehousing so we used it to automatically summarise historical severities and availability percentages over the month and then used Tivoli Common Reporting to produce the reports.
Event Grouping. This is a feature of the OMNIbus Event Viewer that allows related events to be grouped together visually. The hierarchical nature of the enrichment data populated in to each event – Service, Application, Component etc. is perfect for this grouping. This allows operators to quickly visualise and review events that are affecting the same top level business service or application.
Synthetic Events. When the Impact policy performs the severity calculation it checks whether an item’s severity has changed since the previous cycle. This means it is able to generate synthetic problem and resolution events for each node in the topology when health is degraded. This is useful for triggering other notifications, and also for archiving to the Netcool Reporter database for historical reports.