During day to day operations most IT environments are subject to some degree of planned change or managed outage. If effective monitoring is in place, it is highly likely that this type of activity will result in the generation of “false-positive” alerts that have little or no value to the business whilst the work is ongoing. As a result, there is often a requirement for a solution that enables the modification of the monitoring criteria or that changes the management of any resulting alerts such that normal alert response processes are not initiated.
IBM Tivoli Monitoring
ITM itself has no native maintenance mode capability but it can be achieved to some degree through standard product functionality. Situations can be tailored to cater for repeating maintenance windows, such as an application being taken offline each night between 1am and 3am, through the use of day/time attribute comparisons directly in the situation definition, embedded situations or via the Until feature. The use of the latter is explained in the technote http://www-01.ibm.com/support/docview.wss?uid=swg21393406.
Catering for less frequent or unpredictable periods of maintenance is a little more challenging. The most rudimentary approach is to disable the monitoring by shutting down the relevant monitoring agent(s). Of course, this will impact all the monitoring undertaken by the agent regardless of whether it is related to the maintenance work or not. Furthermore, shutting down the agent(s) will leave gaps in any historical data collection being performed. Another method is to look at stopping the relevant situations during the maintenance window or modifying the applicable situation distribution lists to remove those systems undergoing maintenance. The tacmd command interface provides a means of scripting these actions which makes some sort of automated process feasible however the scope of situation distributions, particular when Managed System Lists are used, can again result in non-impacted systems being left exposed.
A number of customers that we work with make use of the situation Action/System Command function to generate an email when a problem is detected. If the suppression of the emails during a known maintenance period is sufficient, extending the notification scripts to include support for maintenance processing is a logical progression. In fact, this is something we have just done for one of our remote support customers, giving them the ability to define a global maintenance window during which no emails will be sent. This solution could easily be developed further using a database to store maintenance window details and allowing for specific managed systems and/or situations to be suppressed.
Finally, a 3rd party solution for implementing maintenance functionality in ITM exists in the form of the Raygun Software ZapMaintenance Agent, which is available through our Agent Store. This solution uses ITM situations to capture maintenance window definitions, identifying agents, situations or Managed System Lists that should be placed in maintenance for a specific of period time. These special situations are processed by the ZapMaintenance ITM agent, which uses the information captured to stop the situations running on the relevant agents.
Screenshot showing a ZapMaintenance situation maintenance window definition
Environments that use the Tivoli Netcool/OMNIbus product have a number of options available for implementing a maintenance capability.
Using probe rules, incoming alerts can be assessed to see if an active maintenance window is applicable. The lookup table functionality is particularly useful for this purpose, with entries being keyed off fields such Node, AlertKey or AlertGroup. If an alert is determined as resulting from ongoing maintenance this can be indicated by setting a value in a field such as SuppressEscl. Upstream automations and alert displays can then reference this field to determine if and how they should interact with the alert.
Maintaining the contents of the lookup files across a distributed installation does present a challenge, particularly where a large number of probes have been deployed. As an alternative, Netcool/OMNIbus automation provides a means of centralising the maintenance processing and is an approach we adopted when implementing a maintenance solution for one of our other remote support customers. This solution is based around a custom table that is used to store maintenance records. Each record defines values for the Node, AlertGroup and AlertKey fields that are used to determine if an incoming alert is considered to be within maintenance. A start and end time for the maintenance window is also defined. A trigger is used to compare each new alert against the records in the maintenance table, if a matching record is found the SuppressEscl field is set which allows the alert to be ignored by other automations, such as those used to initiate the notification process.
A product in the Tivoli Netcool suite aptly suited to maintenance processing is Netcool/Impact, which is now shipped as part of Netcool/OMNIbus under a limited use license. The combination of the ability to reference external data sources such as databases and the rich functionality offered by the policy scripting languages provides a means for developing sophisticated solutions.
Orb Data has written a bespoke Impact based maintenance solution called Maintenance Manager. This works within the constraints of the limited use license by utilising the internal HSQL database to store the maintenance records. The maintenance window definitions are similar to those used in the native Netcool/OMNIbus solution described earlier and consist of fields used to match incoming alerts to a particular window e.g. a maintenance window can be configured to apply to all alerts issued from a specific Node. The functionality offered by Impact means that the matching performed can be more flexible, such as supporting the use of regular expressions. This additional functionality extends to the definition of the time element of the maintenance windows. As well as fixed start and end times, repeating windows based on day of the week, date of the month, ordinal day of the month e.g. last Friday or first Monday in the month are also supported, as are time zones. Impact’s Operator Views are used to provide the user interface and facilitates the review and addition/deletion of the maintenance records held in the database.
Screenshot showing a number defined maintenance windows
Screenshot showing the addition of a new fixed window maintenance record
One other neat feature of our implementation is the ability to control whether affected alerts should be discarded completely or just temporarily suppressed, the latter resulting in alerts that will re-surface once the maintenance window they were associated with expires. This is something that has particular relevance when dealing with sampled ITM situations.
In conclusion, for environments that have Netcool/OMNIbus, Netcool/Impact based solutions stand out as being able to provide the most flexible and functional solutions. Implementing maintenance in ITM only deployments is a bit more challenging but quite feasible nevertheless.