Time Based Monitoring
Introduction
It is a common requirement that monitoring conditions are only evaluated during specific time periods, for example during business hours or outside of weekend maintenance periods. Time based monitoring minimises the time wasted by the Operations Team investigating false events/incidents, i.e. false positives, for example events generated for an inactive service at a time it is known not to be running.
There are several options for avoiding such false positives either through ITM or external solutions within the extended Tivoli portfolio:
- Time attributes within an ITM situation
- Embedded Time Based Situation
- Starting/stopping the situation via a Workflow Policy
- Dynamic Threshold Situations
- The Until option
- External solutions
This tip will focus on the ITM options for avoiding such false positives.
Time Attributes within a Situation
The most basic method for suppressing events is through the inclusion of time based attributes within the monitoring situation. In certain scenarios it is possible to include attributes from different attribute groups within a single situation. In such cases at least one of the attribute groups must be single row, you cannot include two mutli-row attribute groups within a single situation.
The example throughout this tip is implementing a basic Windows disk situation that only alerts from 09:00 to 16:59 on weekdays. For this first example, the basic formula in figure 1 is expanded…
Figure 1
…using the Add Condition option from the Situation Editor to include the time based attributes from the Universal Time attribute group, as highlighted in figure 2.
Figure 2 (click image to view)
These attributes are added to the formula twice to enable a range to be defined for both attributes, as in figure 3.
Figure 3
Note that the time returned by this attribute group is UTC and hence so will need to be adjusted for local time zones.
The main disadvantages of including the time attributes directly within a situation is that you will need to duplicate effort when applying the same time restriction to multiple situations. If the time restriction subsequently changes you will need to edit multiple situations. Finally, from an internal ITM perspective, the SQL built for is relatively complex and hence inefficient.
Embedded Situations
Embedded situations can address some of the draw backs of including the time-attributes directly within the monitoring situation. For this method a dedicated time based situation is created, similar to the out-of-the-box PrimeShift and NonPrimeShift situations. The purely time based situations can be embedded into multiple different monitoring situations. Similar multi-row attribute group restrictions apply.
In this example, the first step is to create a new situation associated with the All Managed Systems agent type, as demonstrated in figure 4.
Figure 4
The condition for the formula again uses the Universal Time attribute group, the formula is shown in Figure 5. Set the sample interval to the same value or lower than the monitoring situation in which it will be embedded.
Figure 5
The new time based situation can then be embedded in the monitoring situation using the Situation Editor Add Conditions dialogue. Select the Situation Comparison radio button to list and highlight the required situation, as in figure 6. The user will be prompted to update the embedded situation’s distribute list to ensure it is evaluated on the same managed systems as the parent situation.
Figure 6
The only permitted condition for such an attribute condition is where the situation is true, as in figure 7.
Figure 7
Scheduled Situation Start/Stop
It is possible to start and stop a situation automatically via Workflow policies. The same time based situation can be utilised as described above. The situation should be distributed to the Hub TEMS. The example workflow below runs when the time based situation is TRUE and starts the disk monitoring situation. It then waits until the same time based situation is false and stops the disk monitoring situation. In this workflow the policy also writes messages to the Universal Message Console at key points.
Figure 8 (click image to view)
Although this has the advantage that no processing cycles are used by the agent during the black-out period, it is more prone to issues and utilises more TEMS cycles in the processing.
Dynamic Threshold Situations
Dynamic threshold situations adapt the monitoring threshold based on defined conditions, for example managed systems, managed system group object (managed system list), attribute values or calendars. Through the use of calendars the monitoring threshold can be modified during the black-out period to a more appropriate level or a level that will never generate a situation event. Dynamic situations are the most flexible of the options as they are not limited by the standard situation formula size of 1020 characters and changes can be applied at a more granular level. Again there are restrictions that apply to dynamic thresholds, for example it is not possible to use the feature where aggregate functions, correlated situations or multiple attribute groups are used.
In this example the base logic has been inverted, i.e. the base situation condition monitors for the percentage free space to be less than zero. This condition should never occur, and hence will never generate a situation event. The first step is to define the suitable calendar from the command line interface tacmd addcalendarentry, see figure 9.
Figure 9 (click image to view)
The override formula dialogue is accessed from the distribution tab of the Situation Editor. Select a Managed System or Managed System Group Object and click the button Override Formula…. Figure 10 shows the Formula Override expression defined for the example, the updated threshold highlighted in blue.
Figure 10
This updated threshold is applied based on the selected schedule. Hence, during the defined calendar period, 0900-1700 weekdays, the threshold will be “overridden” from =<0% to =<5%.The dialog displayed in figure 11, used for selecting the schedule, is accessed from the calendar icon, with the green check-mark, as displayed in figure 10.
Figure 11
Such overrides can be made specific to managed system group objects (Managed System Lists) or managed systems making it a very flexible option.
Use of the “Until” Clause
The final purely ITM method for suppressing situation events during a specific time period is through the use of the “until” option within the Situation Editor. This enables a situation event to remain active until a second situation evaluates to true. In this scenario the second situation is a time based situation, and defines the black-out period.
Figure 12 demonstrates the time-based situation to define the blackout period. This situation is distributed to the same managed systems as the monitoring situation, but is not associated with a navigator item.
Figure 13 shows the situation editor “until” tab for the monitoring situation with the situation for the time-base situation selected.
This is an effective method that can be used for simple scenarios, being much more efficient than the option for including time-based attributes within the situation, although it not as flexible as the dynamic threshold option.
External suppression
The above techniques have looked at ITM methods for suppressing situation events during the black-out period. External techniques could be used, for example using SOAP commands to stop the situations. It is also be possible to use external event processing engines to suppress events, for example Netcool/OMNIbus. The latest version of Netcool/OMNIbus ships with a limited license for Netcool/Impact that includes a maintenance solution, a solution Orb Data can implement for you as part of the Netcool/Impact Fast Start service. A reoccurring maintenance period may be set-up to suppress events matching specific criteria. This may be a preferable option based on the complexity of the black-out windows and the varying type of condition to be suppressed.
Conclusions
As indicated there are several different methods for suppressing situation events from within ITM. The method chosen will be dependent on the overall monitoring infrastructure, complexity of the monitoring conditions and the complexity of the black-out periods. It is usually preferable to suppress as near to source as possible, i.e. at the situation condition level, however in a more heterogeneous environment, with events generated by many different monitoring tools, it may make more sense to have a single, central suppression solution that offers a common suppression solution for all events.
Views: 276