2012-3Q (v002): SIMPLer: Adjustable NHM parameters
A new feature has been enabled to allow adjusting NHM parameters in order to decide what conditions SMS messages / emails should be sent in.
IMPORTANT: This feature can only be enabled/disabled or adjusted by an Azotel employee so operators should contact firstname.lastname@example.org to discuss enabling it on an operator's instance.
There are three parameters to be adjusted as on below images:
Default values used are:
NHM - Fail Interval = 300 seconds
NHM - Fail Recover Time = 2700 seconds
NHM - Number of Failures = 5
Below is an overview of how the NHM process works in SIMPLer. The key things we try to achieve are (1) ensure an operator is notified as soon as possible once we are sure there is an outage, and (2) notify the operator that the equipment has been restored once we are sure it is no longer failing and (3) try to avoid spurious fail/restore messages. In particular, we are slow to send a "restore" message until we are sure the equipment has been stable for a reasonable amount of time - we want to avoid ping-pong alert/restore messages.
The process works as follows:
1) SIMPLer downloads a list of IPs to be monitored to the WIB.
2) The WIB pings each of these IPs on a regular basis (approx ever 30 seconds) and reports any failures back to SIMPLer - these are logged in the database.
3) Every 5 minutes SIMPLer scans the database to detect hosts that have "failed" and hosts that have "restored". The logic used for each is as follows:
Failed: A host is deemed to have failed and an alert is generated if:
1) more than <NHM - Number of Failures> ping failures have been logged for the host AND
2) the time between the first failure and the most recent failure is greater than <NHM - Fail Interval> AND
3) the most recent failure was less than <NHM - Fail Recover Time> ago.
The logic behind this is that a host must be failing frequently for more than <NHM - Fail Interval> before we alert. However, if we get an intermittent ping failure, then an alert will not be generated.
Restored: A host is deemed to have restored and the message is sent if:
1) the most recent failure was more than <NHM - Fail Recover Time> ago.
Note, that if we get intermittent failures from the host, that will keep "resetting" the counter and so, it is only when we have not received a failure for over <NHM - Fail Recover Time> will we send the "all clear" message - this is the usual reason for delays in the "restored" message. The idea here is that we want to be sure that the host is stable before we clear it from the database - otherwise we could just end up sending another alert message shortly after the restore message.
Azotel | River House | Blackpool Park | Cork | Ireland
US +1-312-239-0680 | IE +353-21-234-8100 | UK +44-207-193-4170 | SA +27-11-083-6900