If the HostResourceSwRunMonitor fails because of a timeout, the reason does not reflect it.

Description

For example, the HostResourceSwRunMonitor is giving misleading information when the poll times out. Many of the other monitors indicate the difference between a poll failed, timeout over a not found.

crond outage identified on interface 1.1.1.1 with reason code: HostResourceSwRunMonitor service not found, addr=1.1.1.1, service-name=~^(cron[d]{0,1})$.
SNMP outage identified on interface 1.1.1.1 with reason code: SNMP poll failed, addr=1.1.1.1 oid=.1.3.6.1.2.1.1.2.0.
Load Average outage identified on interface 1.1.1.1 with reason code: SNMP poll failed, addr=1.1.1.1 oid=.1.3.6.1.4.1.2021.10.1.5.3.

To reproduce the behavior, replace the community string and wait for the outage:

$ ./poller-test -i 172.20.1.135 -s Skype -P example1 Checking service Skype on IP 172.20.1.135 Package: example1 Monitor: org.opennms.netmgt.poller.monitors.HostResourceSwRunMonitor Parameter service-name : Skype Available ? false (status Down[2]) Reason: HostResourceSwRunMonitor service not found, addr=172.20.1.135, service-name=Skype

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Alejandro Galue July 2, 2014 at 2:24 PM

Fixed on revision f52df67eb99ff4aa28e84ee318538646b561c2b0 for 1.12

Alejandro Galue July 2, 2014 at 12:46 PM

I'm almost sure that all the SNMP based monitors suffer a similar problem.

Alejandro Galue July 2, 2014 at 12:44 PM

With a small code fix, it is possible to return a different message if the SNMP walker generates an error, for example:

$ ./poller-test -i 172.20.1.135 -s Skype -P example1 Checking service Skype on IP 172.20.1.135 Package: example1 Monitor: org.opennms.netmgt.poller.monitors.HostResourceSwRunMonitor Parameter service-name : Skype Available ? false (status Down[2]) Reason: Timeout retrieving HostResourceSwRunMonitor for /172.20.1.135: HostResourceSwRunMonitor: snmpTimeoutError for: /172.20.1.135

Besides that, HostResourceSwRunMonitor is not returning the response time when the service is available.

With the following patch, we can fix both issues:

diff --git a/opennms-services/src/main/java/org/opennms/netmgt/poller/monitors/HostResourceSwRunMonitor.java b/opennms-services/src/main/java/org/opennms/n index f83a23b..952ab98 100644 --- a/opennms-services/src/main/java/org/opennms/netmgt/poller/monitors/HostResourceSwRunMonitor.java +++ b/opennms-services/src/main/java/org/opennms/netmgt/poller/monitors/HostResourceSwRunMonitor.java @@ -38,6 +38,7 @@ import java.util.Map; import org.apache.log4j.Level; import org.opennms.core.utils.InetAddressUtils; import org.opennms.core.utils.ParameterMap; +import org.opennms.core.utils.TimeoutTracker; import org.opennms.netmgt.config.SnmpPeerFactory; import org.opennms.netmgt.model.PollStatus; import org.opennms.netmgt.poller.Distributable; @@ -236,10 +237,18 @@ public class HostResourceSwRunMonitor extends SnmpMonitorStrategy { statusResults.put(result.getInstance(), result.getValue(serviceStatusOidId)); } }; - TableTracker tracker = new TableTracker(callback, serviceNameOidId, serviceStatusOidId); - SnmpWalker walker = SnmpUtils.createWalker(agentConfig, "HostResourceSwRunMonitor", tracker); + TimeoutTracker tracker = new TimeoutTracker(parameters, agentConfig.getRetries(), agentConfig.getTimeout()); + tracker.reset(); + tracker.startAttempt(); + + TableTracker tableTracker = new TableTracker(callback, serviceNameOidId, serviceStatusOidId); + SnmpWalker walker = SnmpUtils.createWalker(agentConfig, "HostResourceSwRunMonitor", tableTracker); walker.start(); walker.waitFor(); + String error = walker.getErrorMessage(); + if (error != null && !error.trim().equals("")) { + return logDown(Level.WARN, error); + } // Iterate over the list of running services for(SnmpInstId nameInstance : nameResults.keySet()) { @@ -251,7 +260,7 @@ public class HostResourceSwRunMonitor extends SnmpMonitorStrategy { log().debug("poll: HostResourceSwRunMonitor poll succeeded, addr=" + hostAddress + ", service-name=" + serviceName + ", value=" + name // Using the instance of the service, get its status and see if it meets the criteria if (meetsCriteria(value, "<=", runLevel)) { - status = PollStatus.available(); + status = PollStatus.available(tracker.elapsedTimeInMillis()); // If we get here, that means the service passed the criteria, if only one match is desired we exit. if ("false".equals(matchAll)) { return status;

Now, if the community is correct:

$ ./poller-test -i 172.20.1.135 -s Skype -P example1 Checking service Skype on IP 172.20.1.135 Package: example1 Monitor: org.opennms.netmgt.poller.monitors.HostResourceSwRunMonitor Parameter service-name : Skype Available ? true (status Up[1]) Response time: 304.041
Fixed

Details

Assignee

Reporter

Labels

Fix versions

Affects versions

Priority

PagerDuty

Created July 2, 2014 at 12:42 PM
Updated January 27, 2017 at 4:21 PM
Resolved July 2, 2014 at 2:24 PM

Flag notifications