Seibold, Michael
2017-07-10 15:45:05 UTC
Hi Alejandro,
can you confirm that this response time behaviour is the same for all polling services or do they use different models for the timeouts?
Maybe the actual behaviour is quite excellent in some occasions:
I often wondered how I could count the retries that occur while polling services. To avoid too many false alarms retries are obviously helpful. We used to have some scripts using statistical functions like summing up all retries for a given service / location / name_it, summing up response times for all locations, calculating standard deviations etc. to see problems in provider networks even when there were no alarms for outages. But this required additional polling of those services outside from the monitoring system.
With the actual behaviour and a little bit of thinking it should be possible to get the count for the retries out of the response times.... something like sum(retries=integer(response time / timeout)) for all nodes in remote locations.
-Michael
NMS-9475 Description:
During a support session, I've discovered that the response time graphs associated with the SnmpMonitor shows values greater than the configured timeout.
Digging into the code, I found that the reason for this is due to the fact that the TimeTracker responsible to return the actual value of the response time is created outside the retry loop, and it is not re-initialized on each attempt (or retry). So, if the monitor implementation has to retry to get a response, and it actually gets the response within one of the retry attempts, the response time will be the total amount of time spent during all the attempts (which can be greater than the timeout).
A future enhancement could be add an optional parameter to let the user choose the behavior. In this case, we can choose between having the total transaction time, or having the time spent on the last attempt.
For now, update the documentation to reflect the current behavior is enough.
can you confirm that this response time behaviour is the same for all polling services or do they use different models for the timeouts?
Maybe the actual behaviour is quite excellent in some occasions:
I often wondered how I could count the retries that occur while polling services. To avoid too many false alarms retries are obviously helpful. We used to have some scripts using statistical functions like summing up all retries for a given service / location / name_it, summing up response times for all locations, calculating standard deviations etc. to see problems in provider networks even when there were no alarms for outages. But this required additional polling of those services outside from the monitoring system.
With the actual behaviour and a little bit of thinking it should be possible to get the count for the retries out of the response times.... something like sum(retries=integer(response time / timeout)) for all nodes in remote locations.
-Michael
NMS-9475 Description:
During a support session, I've discovered that the response time graphs associated with the SnmpMonitor shows values greater than the configured timeout.
Digging into the code, I found that the reason for this is due to the fact that the TimeTracker responsible to return the actual value of the response time is created outside the retry loop, and it is not re-initialized on each attempt (or retry). So, if the monitor implementation has to retry to get a response, and it actually gets the response within one of the retry attempts, the response time will be the total amount of time spent during all the attempts (which can be greater than the timeout).
A future enhancement could be add an optional parameter to let the user choose the behavior. In this case, we can choose between having the total transaction time, or having the time spent on the last attempt.
For now, update the documentation to reflect the current behavior is enough.