Low counter thresholds trip falsely after a data collection outage

Description

If you schedule a data collection outage on a node, and have a counter-based low threshold set on a datasource, the threshold will always fire when the outage ends. The workaround is to set the trigger to something other than 1, but that hampers one's ability to find and detect problems quickly.

Repro:
-Schedule a data collection outage on a node
-Set a low threshold (with a trigger of 1) on a counter collected from that node
-Wait for outage to end

Environment

Operating System: All Platform: PC

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Jeff Gehlbach March 31, 2010 at 10:44 AM

Thanks for the update, Richard.

Resolving bug, guess it goes into the list for 1.6.11 since we're not sure exactly when it was fixed.

Richard Hesse March 29, 2010 at 1:01 PM

Jeff, looks like this was resolved sometime between 1.6.7 and 1.6.10. We're no longer seeing the issue.

Jeff Gehlbach March 29, 2010 at 12:43 PM

Hey Richard, have you had a chance to upgrade to 1.6.10 and check whether this issue still exists? I think there have been some thresholding fixes since 1.6.7.

Richard Hesse November 6, 2009 at 3:07 PM

You tested with gauges. Gauges work fine. It are counters that are busted. Try re-running your test with a counter.

Alejandro Galue November 6, 2009 at 1:56 PM

Hello,

Using latest 1.6-testing branch (rev. 15189). I tried to reproduce the problem but with no success. For me it is working. This is what I did:

1. Compile 1.6-testing branch

2. Add mib2-host-resources-storage to Net-SNMP's systemDef on datacollection-config.xml to use hrStorageIndex based threshold.

3. Run OpenNMS

4. Add a testing node (one of my linux boxes running Fedora 11 x86).

5. Wait for testing device is properly discovered and snmp graphs are ok.

6. Create a low threshold using WebUI:

<group name="netsnmp" rrdRepository="/Users/agalue/Development/opennms/1.6-testing/opennms/target/opennms-1.6.8-TESTING-SNAPSHOT/share/rrd/snmp/">
<expression type="low" ds-type="hrStorageIndex" value="10.0" rearm="25.0" trigger="1" ds-label="hrStorageDescr" expression="hrStorageUsed / hrStorageSize * 100.0">
<resource-filter field="hrStorageType">^\.1\.3\.6\.1\.2\.1\.25\.2\.1\.4$</resource-filter>
</expression>
</group>

Note: Current file systems levels are ok, so threshold should not be triggered.

7. Create a scheduled outage using WebUI for collectd only (because in-line thresholding is not related with threshd):

<?xml version="1.0" encoding="UTF-8"?>
<outages xmlns="http://xmlns.opennms.org/xsd/config/poller/outages">
<outage name="My Test" type="specific">
<ns1:time xmlns:ns1="http://xmlns.opennms.org/xsd/types"
begins="06-Nov-2009 12:00:00" ends="06-Nov-2009 13:00:00"/>
<interface address="192.168.190.30"/>
<node id="3"/>
</outage>
</outages>

and ...

<package name="example1">
<filter>IPADDR != '0.0.0.0'</filter>
<include-range begin="1.1.1.1" end="254.254.254.254"/>
<service name="SNMP" interval="300000" user-defined="false" status="on">
<parameter key="collection" value="default"/>
<parameter key="thresholding-enabled" value="true"/>
</service>
<outage-calendar xmlns="">My Test</outage-calendar>
</package>

8. Outage begins.

9. Outage ends.

I verified that the graph was empty during the scheduled outage as expected, but no low threshold triggered.

Am I doing something wrong? Did I forgot something ?

What version of OpenNMS are you using? Can you post your configuration ?

Fixed

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

PagerDuty

Created October 30, 2009 at 2:51 PM
Updated January 27, 2017 at 4:25 PM
Resolved May 20, 2010 at 1:27 AM