icmp and http thresholds not working properly

Description

As I know when the settings for threshold are:

OpenNMS should send alert when ping exceeds value 75000 five polls in a
row. But here is an example when OpenNMS sends alert on second poll when
threshold exceeded:

threshd.log:

1st poll:

2006-09-04 07:01:52,717 DEBUG [ThreshdScheduler-5 Pool-fiber4]
LatencyThresholder: check: service= ICMP interface= 172.16.254.29
nodeId= 44 thresholding-group=icmp-latency interval=300000ms
2006-09-04 07:01:52,717 DEBUG [ThreshdScheduler-5 Pool-fiber4]
LatencyThresholder: check: rrd repository=/var/lib/opennms/rrd/response/
2006-09-04 07:01:52,717 DEBUG [ThreshdScheduler-5 Pool-fiber4]
LatencyThresholder: checkPerformanceDir: threshold checking dir:
/var/lib/opennms/rrd/response/172.16.254.29
2006-09-04 07:01:52,717 DEBUG [ThreshdScheduler-5 Pool-fiber4]
LatencyThresholder: checking value of last possible PDP only
2006-09-04 07:01:52,717 DEBUG [ThreshdScheduler-5 Pool-fiber4]
JniRrdStrategy: fetch: Issuing RRD command: fetch
/var/lib/opennms/rrd/response/172.16.254.29/icmp.rrd AVERAGE -s now-300
-e now-300
2006-09-04 07:01:52,772 DEBUG [ThreshdScheduler-5 Pool-fiber4]
JniRrdStrategy: fetch: fetch successful: icmp= 1698.8533333
2006-09-04 07:01:52,772 DEBUG [ThreshdScheduler-5 Pool-fiber4]
ThresholdEntity: evaluate: value= 1698.8533333 against threshold:
dsName=icmp,dsType=if: highVal=75000.0,highRearm=25000.0,highTrigger=5
2006-09-04 07:01:52,773 DEBUG [ThreshdScheduler-5 Pool-fiber4]
Scheduler: schedule: Adding ready runnable
org.opennms.netmgt.threshd.ThresholdableService@18a9fc8 at interval
300000
2006-09-04 07:01:52,773 DEBUG [ThreshdScheduler-5 Pool-fiber4]
Scheduler: schedule: queue element added, notification not performed

2nd poll:
2006-09-04 07:06:53,463 DEBUG [ThreshdScheduler-5 Pool-fiber4]
JniRrdStrategy: fetch: Issuing RRD command: fetch
/var/lib/opennms/rrd/response/172.16.254.29/icmp.rrd AVERAGE -s now-300
-e now-300
2006-09-04 07:06:53,463 DEBUG [ThreshdScheduler-5 Pool-fiber4]
JniRrdStrategy: fetch: fetch successful: icmp= 315766.18
2006-09-04 07:06:53,463 DEBUG [ThreshdScheduler-5 Pool-fiber4]
ThresholdEntity: evaluate: value= 315766.18 against threshold:
dsName=icmp,dsType=if: highVal=75000.0,highRearm=25000.0,highTrigger=5
2006-09-04 07:06:53,463 DEBUG [ThreshdScheduler-5 Pool-fiber4]
ThresholdEntity: evaluate: high threshold exceeded, count=4
2006-09-04 07:06:53,463 DEBUG [ThreshdScheduler-5 Pool-fiber4]
Scheduler: schedule: Adding ready runnable
org.opennms.netmgt.threshd.ThresholdableService@18a9fc8 at interval
300000
2006-09-04 07:06:53,464 DEBUG [ThreshdScheduler-5 Pool-fiber4]
Scheduler: schedule: queue element added, notification not performed

3rd poll:
2006-09-04 07:11:54,293 DEBUG [ThreshdScheduler-5 Pool-fiber4]
JniRrdStrategy: fetch: Issuing RRD command: fetch
/var/lib/opennms/rrd/response/172.16.254.29/icmp.rrd AVERAGE -s now-300
-e now-300
2006-09-04 07:11:54,294 DEBUG [ThreshdScheduler-5 Pool-fiber4]
JniRrdStrategy: fetch: fetch successful: icmp= 81693.406667
2006-09-04 07:11:54,294 DEBUG [ThreshdScheduler-5 Pool-fiber4]
ThresholdEntity: evaluate: value= 81693.406667 against threshold:
dsName=icmp,dsType=if: highVal=75000.0,highRearm=25000.0,highTrigger=5
2006-09-04 07:11:54,294 DEBUG [ThreshdScheduler-5 Pool-fiber4]
ThresholdEntity: evaluate: high threshold exceeded, count=5
2006-09-04 07:11:54,294 DEBUG [ThreshdScheduler-5 Pool-fiber4]
ThresholdEntity: evaluate: high threshold triggered!

rrd.xml:

<row><v>
2.0627266667e+03 </v></row>

<row><v> 1.7620166667e+03 </v></row>

<row><v> 1.7191200000e+03 </v></row>

<row><v> 2.3050833333e+03 </v></row>

<row><v> 1.9532533333e+03 </v></row>

<row><v> 1.6988533333e+03 </v></row>

<row><v> 3.1576618000e+05 </v></row>

<row><v> 8.1693406667e+04 </v></row>

<row><v> 7.5631266667e+03 </v></row>

<row><v> 2.8765233333e+03 </v></row>

It seems like OpenNMS doesn't reset the count of exceeded thresholds. And this
occurs consistently.

My config:

OpenNMS Version: 1.2.8-1
Java Version: 1.4.2_12 Sun Microsystems Inc.
Java Virtual Machine: 1.4.2_12-b03 Sun Microsystems Inc.
Operating System: Linux 2.6.8-2-386 (i386)
Servlet Container: Apache Tomcat/4.1 (Servlet Spec 2.3)

Environment

Operating System: Linux Platform: PC URL: http://sourceforge.net/mailarchive/forum.php?thread_id=30459123&forum_id=40670

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Alejandro Galue December 8, 2010 at 11:10 AM

A minor change on ThresholdEvaluatorHighLow.java did the trick.

Now when a monitored value is rearmed before the threshold state reaches the trigger value, the counter will be reseted, and it will start again if the monitored value pass the configured threshold.

Alejandro Galue December 8, 2010 at 11:06 AM

Gabriel Gwartney, gave me a better picture of why this problem reappear again.

Here is his analysis:

I'm currently evaluating OpenNMS 1.8.4 and I noticed that there is
still a bug in thresholding that is similar to bug 1582. When
specifying trigger value N in a threshold, the trigger count does not
reset when the value falls below the threshold unless it has occurred N
times and a threshold rearm event has been generated. For example, we
gather an SNMP integer value from our devices in the field which
represent the current status of certain attributes. We evaluate whether
certain bits are set or not. If a certain combination of set bits
occurs N times in a row, we want an event generated. However, there are
instances where these certain bits were set for a brief period of time
(and the thresholding daemon increased the trigger count) but then
became unset. The thresholding daemon does not reset the trigger count
at this point. According to bug 1582, this behavior was fixed in 1.7.6
(at least that's the target milestone) but appears to still be present.
Is there a certain threshold configuration now that will create an event
when the threshold has been breached N times in a row?
Here is an example:

High threshold has been rearmed:
2010-10-15 14:02:21,742 DEBUG [CollectdScheduler-50 Pool-fiber0]
ThresholdEntity: evaluate: value= 0.0 against threshold:
{evaluator=high, dsName=(tgbGpsSyncFlags%2)+((tgbGpsSyncFlags/2)%2),
dsType=tgbNtpIndex,
evaluators=[{ds=(tgbGpsSyncFlags%2)+((tgbGpsSyncFlags/2)%2), value=2.0,
rearm=1.0, trigger=6}]}
2010-10-15 14:02:21,742 DEBUG [CollectdScheduler-50 Pool-fiber0]
ThresholdEvaluatorHighLow$ThresholdEvaluatorStateHighLow: evaluate: high
threshold rearmed

First trigger of high threshold:
2010-10-15 14:12:26,046 DEBUG [CollectdScheduler-50 Pool-fiber0]
ThresholdEntity: evaluate: value= 2.5 against threshold:
{evaluator=high, dsName=(tgbGpsSyncFlags%2)+((tgbGpsSyncFlags/2)%2),
dsType=tgbNtpIndex,
evaluators=[{ds=(tgbGpsSyncFlags%2)+((tgbGpsSyncFlags/2)%2), value=2.0,
rearm=1.0, trigger=6}]}
2010-10-15 14:12:26,046 DEBUG [CollectdScheduler-50 Pool-fiber0]
ThresholdEvaluatorHighLow$ThresholdEvaluatorStateHighLow: evaluate: high
threshold exceeded, count=1

Next, the value has fallen below the rearm level:
2010-10-15 14:17:27,115 DEBUG [CollectdScheduler-50 Pool-fiber0]
ThresholdEntity: evaluate: value= 0.0 against threshold:
{evaluator=high, dsName=(tgbGpsSyncFlags%2)+((tgbGpsSyncFlags/2)%2),
dsType=tgbNtpIndex,
evaluators=[{ds=(tgbGpsSyncFlags%2)+((tgbGpsSyncFlags/2)%2), value=2.0,
rearm=1.0, trigger=6}]}

Notice no reset of the trigger count.

Now, threshold has been breached once again.
2010-10-15 14:22:28,666 DEBUG [CollectdScheduler-50 Pool-fiber0]
ThresholdEntity: evaluate: value= 2.5 against threshold:
{evaluator=high, dsName=(tgbGpsSyncFlags%2)+((tgbGpsSyncFlags/2)%2),
dsType=tgbNtpIndex,
evaluators=[{ds=(tgbGpsSyncFlags%2)+((tgbGpsSyncFlags/2)%2), value=2.0,
rearm=1.0, trigger=6}]}
2010-10-15 14:22:28,666 DEBUG [CollectdScheduler-50 Pool-fiber0]
ThresholdEvaluatorHighLow$ThresholdEvaluatorStateHighLow: evaluate: high
threshold exceeded, count=2

You'll see the count still increments instead of starting at 1 again.
It will continue in this manner until count=6. Then it will fire off
the high threshold alarm. The next time the value falls below the
threshold, a rearm event is triggered and the count resets to 0. To me
these are false positives. If a value falls to the rearm level, the
count should be reset.

I hope this helps.

Alejandro Galue June 3, 2009 at 1:38 PM

Fixed on trunk using new in-line thresholding on pollerd (not checked with deprecated threshd).

PETER: If you still have problems with this, please reopen this bug.

iMacAlejo:opennms agalue$ svn log -l 1 opennms-services/src/test/java/org/opennms/netmgt/threshd/ThresholdingVisitorTest.java
------------------------------------------------------------------------ r13555 | agalue | 2009-06-03 13:04:31 -0430 (Wed, 03 Jun 2009) | 3 lines

Updating ThresholdingVisitorTest.testLatencyThresholdingSet to validate that has been fixed.
New test added ThresholdingVisitorTest.testThresholsFilters, just to be sure that resource filters work fine.
Minor correction on ThresholdingSet.
------------------------------------------------------------------------

Alejandro Galue June 2, 2009 at 2:14 PM

(In reply to comment #2)
> any chance to resolve this bug in ONMS 1.7.x? It's very annoying :-/
>
Hello Peter,

The current trunk version has in-line thresholds processing for pollerd. I will really appreciate if you can make tests against the current trunk.

In order to use the new thresholding processing engine, you must disable thresd (I mean, comment thresd entry on service-configuration.xml or remove any thresholder tag on thresd-configuration.xml).

And then only create a package on thresd that match your nodes and services; something like this:

<package name="myLatencyPackage">
<filter>IPADDR != '0.0.0.0'</filter>
<specific>10.10.10.10</specific>
<include-range begin="192.168.0.1" end="192.168.0.254"/>
<service name="ICMP" interval="300000" user-defined="false" status="on">
<parameter key="thresholding-group" value="icmp-snmp"/>
</service>
</package>

Then enable new in-line thresholding on pollerd-configuration.xml

Now pollerd will process thresholds like collectd.

The only difference between this and old threshd is that RRD will never be touched again (a boost in performance) and the compared value will be exactly the same value collected (gauge, not a rate).

Any issues please let me know in order to fix the new implementation.

In the meanwhile, I will try to add some junit tests against the trunk version to see if I can reproduce the problem.

Alejandro.

Peter Herzig May 8, 2009 at 12:32 PM

(In reply to comment #1)
> moving to 1.6.1 target milestone; only things left pending for 1.6.0 are
> blocker-level bugs

Hi,

any chance to resolve this bug in ONMS 1.7.x? It's very annoying :-/

Petr Herzig

Fixed

Details
Assignee
Alejandro Galue
Reporter
Peter Herzig
Components
Fix versions
1.8.7
1.9.4
Affects versions
1.2.8
Priority
Major

PagerDuty

Created September 11, 2006 at 2:33 AM

Updated January 27, 2017 at 4:32 PM

Resolved December 8, 2010 at 11:10 AM

icmp and http thresholds not working properly

Description

Environment

Acceptance / Success Criteria

Lucidchart Diagrams

Activity

Alejandro Galue December 8, 2010 at 11:10 AM

Alejandro Galue December 8, 2010 at 11:06 AM

Alejandro Galue June 3, 2009 at 1:38 PM

Alejandro Galue June 2, 2009 at 2:14 PM

Peter Herzig May 8, 2009 at 12:32 PM

DetailsAssigneeAlejandro GalueAlejandro GalueReporterPeter HerzigPeter HerzigComponentsFix versions1.8.71.9.4Affects versions1.2.8PriorityMajor

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Details
Assignee
Alejandro Galue
Reporter
Peter Herzig
Components
Fix versions
1.8.7
1.9.4
Affects versions
1.2.8
Priority
Major

PagerDuty