If the HostResourceSwRunMonitor fails because of a timeout, the reason does not reflect it.
Description
Acceptance / Success Criteria
None
Lucidchart Diagrams
Activity
Alejandro Galue July 2, 2014 at 2:24 PM
Fixed on revision f52df67eb99ff4aa28e84ee318538646b561c2b0 for 1.12
Alejandro Galue July 2, 2014 at 12:46 PM
I'm almost sure that all the SNMP based monitors suffer a similar problem.
Alejandro Galue July 2, 2014 at 12:44 PM
With a small code fix, it is possible to return a different message if the SNMP walker generates an error, for example:
$ ./poller-test -i 172.20.1.135 -s Skype -P example1
Checking service Skype on IP 172.20.1.135
Package: example1
Monitor: org.opennms.netmgt.poller.monitors.HostResourceSwRunMonitor
Parameter service-name : Skype
Available ? false (status Down[2])
Reason: Timeout retrieving HostResourceSwRunMonitor for /172.20.1.135: HostResourceSwRunMonitor: snmpTimeoutError for: /172.20.1.135
Besides that, HostResourceSwRunMonitor is not returning the response time when the service is available.
With the following patch, we can fix both issues:
diff --git a/opennms-services/src/main/java/org/opennms/netmgt/poller/monitors/HostResourceSwRunMonitor.java b/opennms-services/src/main/java/org/opennms/n
index f83a23b..952ab98 100644
--- a/opennms-services/src/main/java/org/opennms/netmgt/poller/monitors/HostResourceSwRunMonitor.java
+++ b/opennms-services/src/main/java/org/opennms/netmgt/poller/monitors/HostResourceSwRunMonitor.java
@@ -38,6 +38,7 @@ import java.util.Map;
import org.apache.log4j.Level;
import org.opennms.core.utils.InetAddressUtils;
import org.opennms.core.utils.ParameterMap;
+import org.opennms.core.utils.TimeoutTracker;
import org.opennms.netmgt.config.SnmpPeerFactory;
import org.opennms.netmgt.model.PollStatus;
import org.opennms.netmgt.poller.Distributable;
@@ -236,10 +237,18 @@ public class HostResourceSwRunMonitor extends SnmpMonitorStrategy {
statusResults.put(result.getInstance(), result.getValue(serviceStatusOidId));
}
};
- TableTracker tracker = new TableTracker(callback, serviceNameOidId, serviceStatusOidId);
- SnmpWalker walker = SnmpUtils.createWalker(agentConfig, "HostResourceSwRunMonitor", tracker);
+ TimeoutTracker tracker = new TimeoutTracker(parameters, agentConfig.getRetries(), agentConfig.getTimeout());
+ tracker.reset();
+ tracker.startAttempt();
+
+ TableTracker tableTracker = new TableTracker(callback, serviceNameOidId, serviceStatusOidId);
+ SnmpWalker walker = SnmpUtils.createWalker(agentConfig, "HostResourceSwRunMonitor", tableTracker);
walker.start();
walker.waitFor();
+ String error = walker.getErrorMessage();
+ if (error != null && !error.trim().equals("")) {
+ return logDown(Level.WARN, error);
+ }
// Iterate over the list of running services
for(SnmpInstId nameInstance : nameResults.keySet()) {
@@ -251,7 +260,7 @@ public class HostResourceSwRunMonitor extends SnmpMonitorStrategy {
log().debug("poll: HostResourceSwRunMonitor poll succeeded, addr=" + hostAddress + ", service-name=" + serviceName + ", value=" + name
// Using the instance of the service, get its status and see if it meets the criteria
if (meetsCriteria(value, "<=", runLevel)) {
- status = PollStatus.available();
+ status = PollStatus.available(tracker.elapsedTimeInMillis());
// If we get here, that means the service passed the criteria, if only one match is desired we exit.
if ("false".equals(matchAll)) {
return status;
Now, if the community is correct:
$ ./poller-test -i 172.20.1.135 -s Skype -P example1
Checking service Skype on IP 172.20.1.135
Package: example1
Monitor: org.opennms.netmgt.poller.monitors.HostResourceSwRunMonitor
Parameter service-name : Skype
Available ? true (status Up[1])
Response time: 304.041
Fixed
Details
Assignee
Alejandro GalueAlejandro GalueReporter
Alejandro GalueAlejandro GalueLabels
Components
Fix versions
Affects versions
Priority
Minor
Details
Details
Assignee
Alejandro Galue
Alejandro GalueReporter
Alejandro Galue
Alejandro GalueLabels
Components
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

PagerDuty Incident
Created July 2, 2014 at 12:42 PM
Updated January 27, 2017 at 4:21 PM
Resolved July 2, 2014 at 2:24 PM
For example, the HostResourceSwRunMonitor is giving misleading information when the poll times out. Many of the other monitors indicate the difference between a poll failed, timeout over a not found.
crond outage identified on interface 1.1.1.1 with reason code: HostResourceSwRunMonitor service not found, addr=1.1.1.1, service-name=~^(cron[d]{0,1})$.
SNMP outage identified on interface 1.1.1.1 with reason code: SNMP poll failed, addr=1.1.1.1 oid=.1.3.6.1.2.1.1.2.0.
Load Average outage identified on interface 1.1.1.1 with reason code: SNMP poll failed, addr=1.1.1.1 oid=.1.3.6.1.4.1.2021.10.1.5.3.
To reproduce the behavior, replace the community string and wait for the outage:
$ ./poller-test -i 172.20.1.135 -s Skype -P example1 Checking service Skype on IP 172.20.1.135 Package: example1 Monitor: org.opennms.netmgt.poller.monitors.HostResourceSwRunMonitor Parameter service-name : Skype Available ? false (status Down[2]) Reason: HostResourceSwRunMonitor service not found, addr=172.20.1.135, service-name=Skype