Discussion:
[opennms-discuss] NMS-9475: Document the logic behind the response time value reported by the SnmpMonitor
Seibold, Michael
2017-07-10 15:45:05 UTC
Permalink
Hi Alejandro,

can you confirm that this response time behaviour is the same for all polling services or do they use different models for the timeouts?

Maybe the actual behaviour is quite excellent in some occasions:

I often wondered how I could count the retries that occur while polling services. To avoid too many false alarms retries are obviously helpful. We used to have some scripts using statistical functions like summing up all retries for a given service / location / name_it, summing up response times for all locations, calculating standard deviations etc. to see problems in provider networks even when there were no alarms for outages. But this required additional polling of those services outside from the monitoring system.

With the actual behaviour and a little bit of thinking it should be possible to get the count for the retries out of the response times.... something like sum(retries=integer(response time / timeout)) for all nodes in remote locations.

-Michael



NMS-9475 Description:

During a support session, I've discovered that the response time graphs associated with the SnmpMonitor shows values greater than the configured timeout.

Digging into the code, I found that the reason for this is due to the fact that the TimeTracker responsible to return the actual value of the response time is created outside the retry loop, and it is not re-initialized on each attempt (or retry). So, if the monitor implementation has to retry to get a response, and it actually gets the response within one of the retry attempts, the response time will be the total amount of time spent during all the attempts (which can be greater than the timeout).

A future enhancement could be add an optional parameter to let the user choose the behavior. In this case, we can choose between having the total transaction time, or having the time spent on the last attempt.

For now, update the documentation to reflect the current behavior is enough.
Alejandro Galue
2017-07-10 15:49:28 UTC
Permalink
Hi Michael,
Post by Seibold, Michael
can you confirm that this response time behaviour is the same for all polling services or do they use different models for the timeouts?
Short answer: I don’t know.

There are about 80 different monitor implementations, so it could take a while as I have to study the source code of each of them in order to tell you if the behavior is the same or not.

Alejandro Galue
***@opennms.org <mailto:***@opennms.org>
PGP Key Fingerprint: 5293 6234 1E75 DF30 7821 1823 87AF 972E DAF8 BE2C
Seibold, Michael
2017-07-10 16:11:39 UTC
Permalink
Hi Alejandro,

don't bother with all of them, way too much work. But maybe you can check for icmp, because with snmp and icmp there is already enough of measurement data to get good results for statistics, and those two are measured on most of the monitored equipement.

Thanks a lot
Michael
Alejandro Galue
2017-07-10 17:04:36 UTC
Permalink
Michael,
Post by Seibold, Michael
don't bother with all of them, way too much work. But maybe you can check for icmp, because with snmp and icmp there is already enough of measurement data to get good results for statistics, and those two are measured on most of the monitored equipement.
ICMP is tricky due to how it is implemented (and also considering that there are several implementations).

It seems like the response time is the time spent on receiving a successful response (taken from the time on which the request was sent).

So, regardless how many retries you have, you'll get the time for the last successful attempt (which is not the case of the SnmpMonitor).

Alejandro Galue
***@opennms.org <mailto:***@opennms.org>
PGP Key Fingerprint: 5293 6234 1E75 DF30 7821 1823 87AF 972E DAF8 BE2C
Seibold, Michael
2017-07-11 05:48:36 UTC
Permalink
Hi Alejandro,


Ø So, regardless how many retries you have, you'll get the time for the last successful attempt (which is not the case of the SnmpMonitor).

thanks a lot!

-Michael

Loading...