Historical Data Collection Restriction and Alerting
The collection of historical data from distributed ITM agents provides valuable information for capacity planning, trend analysis and debugging of problems. However, when an agent is unable to connect to the Warehouse Proxy Agent there is a potential to compromise the stability of the managed server.
The Historical Data Collection solution implemented by ITM has three key phases:
- The collection and storage of the data to “short-term history” files, generally located on the managed system (agent) themselves
- The upload of that historical information to the Data Warehouse, pushed from the agent via the Warehouse Proxy Agent
- Deletion of the uploaded information from the short-term history file
Each phase must be completed successfully for the next phase to run. Hence, if the agent cannot communicate with the Warehouse Proxy agent, for example due to a network outage, data is not deleted from the short-term history files, and the file size may grow unrestricted. This in-turn may compromise the stability of the managed system due to a lack of freespace on the filesystem.
A new feature was added with ITM 6.2 FP01 and ITM 6.1 FP07 to prevent such conditions compromising the stability of the managed systems. The feature enables a size threshold to be defined for the short-term history files. Data collection is stopped if the combined size of the short-term history files exceeds the defined threshold. Although this does result in incomplete data within the data warehouse, the managed system stability is not compromised due to lack of freespace on the filesystem.
This article will discuss how to configure the agents to prevent unrestricted growth of the short-term history files and how to minimise the length of any resultant outages through self monitoring.
Restricting the Short-term History Files
The solution to prevent the unrestricted growth of the short-term history files is based on two environment variables:
- KHD_TOTAL_HIST_MAXSIZE: The maximum size threshold, in megabytes, for the short-term history files. This threshold is based on the combined size of the short-term history files for all agents running on the managed system.
- KHD_HISTSIZE_EVAL_INTERVAL: The interval, in seconds, at which the threshold is checked. This has a minimum value of 60 seconds, and a default of 900 seconds (15 minutes).
As these variables must be defined for each ITM agent collecting historical data on a given managed system, it is recommended they are defined in a generic configuration file and that file is referenced from each of the agent specific configuration files. This method minimises the required administration where multiple agents are collecting historical data on a single managed system.
For example, on a Windows system collecting historical data from the Windows OS Agent and a Universal Agent Application, firstly, create a generic configuration file named %CANDLEHOME%TMAITM6GENENV with the following content:
KHD_TOTAL_HIST_MAXSIZE=4
KHD_HISTSIZE_EVAL_INTERVAL=300
Note: This example will restrict the short-term history files to less than 4MB, and the threshold will be checked every 5 minutes (300 seconds).
Each of the agent environment variable files, i.e. files KNTENV and KUMENV located in the directory %CANDLEHOME%TMAITM6, must be updated with the following environment variable setting:
KBB_ENVPATH=C:IBMITMTMAITM6GENENV
Note, this assumes the default installation directory for ITM on Windows, and both agents will need to be restarted for the restriction to apply.
Similarly, on a Linux platform, the environment variables KHD_TOTAL_HIST_MAXSIZE and KHD_HISTSIZE_EVAL_INTERVAL may be defined in a generic configuration file $CANDLEHOME/config/generic.cfg. This file can then be referenced by each agent, for example in the Linux OS Agent configuration file $CANDLEHOME/config/lz.ini as such:
KBB_ENVPATH=$CANDLEHOME$/config/generic.cfg
The Linux agent must be stopped, reconfigured and restarted for the updates to take effect. Any subsequent changes to the file generic.cfg would only require an agent restart.
Monitoring for Threshold Breaches
This method of protecting the managed system from unrestricted short-term history file growth does have the side effect that, if the threshold is exceeded, historical data will be lost. To minimise the impact of such an outage it is possible to monitor for the condition using a basic Universal Agent application.
To create the UA Application, it is necessary to understand the messages logged by the new feature. Firstly, if the threshold is breached the log entries listed below are written to the agent log file:
(499C27A7.0000-FFC:khdxhwst.cpp,237,"setHistWriteStatus")
ATTENTION: Stopped writing short-term historical data to files in
directory C:IBMITMTMAITM6logs.
(499C27A7.0001-FFC:khdxhwst.cpp,239,"setHistWriteStatus") Total size of historical files 1552KB exceeded the maximum of 1024KB.
If the combined size of the short-term history files subsequently reduces below the threshold, the following information is logged:
(499C276B.0000-FFC:khdxhwst.cpp,246,"setHistWriteStatus")
ATTENTION: Restarted writing short-term historical data to files in
directory C:IBMITMTMAITM6logs.
(499C276B.0001-FFC:khdxhwst.cpp,248,"setHistWriteStatus") Total
size of historical files 590KB is now less than the maximum of 1024KB.
(499C276B.0002-FFC:khdxhwst.cpp,250,"setHistWriteStatus") STH data was not recorded for 0.30 hours.
The Universal Agent File Data Provider can be used to monitor the agent log file, formatting the data, and filtering the short-term history entries. The metafile detailed below is an example of such a Universal Agent Application for monitoring the Windows OS agent log file on a system with the host name ITM62SVR. Minor changes would be required to the SOURCE control statement for this metafile to apply to other managed systems or platforms.
//APPL HISTORICALDATACOLLECTION
//NAME SHORTTERMHISTORY E AddTimeStamp=YearMonth
//SOURCE FILE
C:/IBM/ITM/TMAITM6/Logs/ITM62SVR_nt_kntcma_{TIVOLILOGTIME}.log
//ATTRIBUTES ' '
-RUBBISH1 D 32 DLM=','
CODE C 9999 DLM=',' @Logged Message Code
MODULE D 32 DLM='""' +FILTER={MATCH(0,setHistWriteStatus)} @Logged Module Name
-RUBBISH3 D 32
MESSAGE Z 150 @Logged message
Once the metafiles is loaded into the platform specific METAFILES directory, the file can be imported using the command kumpcon or um_console, on Windows and UNIX/Linux platforms, respectively. For example on Windows, run the commands:
cd C:IBMITMTMAITM6
kumpcon import hdc.mdl
Once the application is on line, the agent will collect data from the latest agent log file. This is driven by the dynamic file naming variable {TIVOLILOGTIME}. The data collected is demonstrates in the figure below, which includes the restart and subsequent stopping of the historical data collection.
Based on this data, two situation can be defined to indicate the stopping and subsequent restart of the historical data collection. The situation definitions below, for situations named Orb_STH_Stopped and Orb_STH_Restarted, check the value of the attribute CODE. A situation event is generated where the value of the attribute is equal to 237 and 246, respectively. Each situation can be defined to close when the other situation in the pair opens. This is required as they are “Pure” situations, i.e. based on ad-hoc log file information, and hence do not automatically close.
Name : Orb_STH_Stopped
Full Name : Orb_STH_Stopped
Description : Short-term History Collection Stopped
Type : Universal Data Provider
Formula : *IF *VALUE HISTORICALDATACOLLECSHORTTERMHISTORY02.CODE *EQ 237 *UNTIL ( *SIT Orb_STH_Restarted )
Sampling Interval : 0/0:0:0
Run At Start Up : Yes
Distribution : ITM62SVR:HISTORICALDATACOLLEC02
Name : Orb_STH_Restarted
Full Name : Orb_STH_Restarted
Description : Short-term History Collection Restarted
Type : Universal Data Provider
Formula : *IF *VALUE HISTORICALDATACOLLECSHORTTERMHISTORY02.CODE *EQ 246 *UNTIL ( *SIT Orb_STH_Stopped )
Sampling Interval : 0/0:0:0
Run At Start Up : Yes
Distribution : ITM62SVR:HISTORICALDATACOLLEC02
Only the situation Orb_STH_Stopped should be associated with the Universal Agent Application attribute group SHORTTERMHISTORY, or forwarded to an EIF receiver, for example the OMNIbus ObjectServer, as that is the only situation representing an abnormal condition. The situation event is seen in the figure below.
Conclusions
The ability to restict the growth of the short-term history files is invaluable for all ITM infrastructures. It is inevitable that the agent communications to the Warehouse Proxy Agent will be unavailable at certain times, whether it is due to maintenance of the WPA server or a network outage. It is highlighy undesireable that such a condition should cause the failure of a business service, and the loss of a limited amount of historical data is a small price to pay to avoid such a scenario.
Views: 330