Network Monitoring Alert Flow

SIMPLer has a specific process flow for alert messages that are sent out when equipment is failing or being restored. The logic built into this process is to help avoid receiving undue alert of outages and to prevent receive a restore alert on an intermittent connection until the device has been stabilized. The exact logic is as follows:

1. SIMPLer uploads a list of IPs to be monitored to the WIB.
2. The WIB pings each of these IPs on a regular basis (approx ever 30 seconds) and reports any failures back to SIMPLer - these are logged in the database.
3. Every 5 minutes SIMPLer scans the database to detect hosts that have "failed" and hosts that have "restored". The logic used for each is as follows:

Failed: A host is deemed to have failed and an alert is generated if:

1. More than 5 ping failures have been logged for the host AND
2. The time between the first failure and the most recent failure is greater than 5 minutes AND
3. The most recent failure was less than 45 minutes ago.

The logic is basically that a host must be failing frequently for more than 5 minutes before we alert. However, if we get an intermittent ping failure (a few an hour), then an alert will not be generated.

Restored: A host is deemed to have restored and the message is sent if:

1. The most recent failure was more than 45 minutes ago.

Note that if WIB continues to receive intermittent failures from host, that will keep "resetting" the counter and so it is only when SIMPLer has not gotten a failure for over 45 minutes will we send the "all clear" message - this is the usual reason for delays in the "restored" message. The idea here is that SIMPLer wants to be sure that the host is stable before it is cleared from the database - otherwise it could just end up sending another alert message shortly after the restore message.

Page updated

Report abuse