Dashboard error caused by the alarms dashlet
Description
Environment
Acceptance / Success Criteria
Attachments
Lucidchart Diagrams
Activity

Ronny Trommer March 23, 2016 at 12:49 PM
Solved with .

peppi December 5, 2012 at 3:17 AM
I have the same error since updating from Version 1.10.1 to 1.10.7
Did you resolve the problem ?

Bruno Brouckaert December 23, 2011 at 8:43 AM
After investing the problem further, I was able to stop the bug from happening. I read up on alarm configuration and I found the root of the problem (atleast I think I did) in the auto-clean functionality of alarms.
The configuration for uei.opennms.org/generic/traps/SNMP_Authen_Failure was the following (I assume it is default):
<event>
<mask>
<maskelement>
<mename>generic</mename>
<mevalue>4</mevalue>
</maskelement>
</mask>
<uei>uei.opennms.org/generic/traps/SNMP_Authen_Failure</uei>
...
<logmsg dest="logndisplay">Incorrect Community Name (authenticationFailure Trap) enterprise:%id% (%id%) args(%parm##%):%parm[all]%</logmsg>
<severity>Warning</severity>
<alarm-data reduction-key="%uei%:%dpname%:%nodeid%"
alarm-type="3" auto-clean="true"/>
</event>
after changing it to the following config (notice the dest and autoclean attributes), the exception stopped popping up:
<event>
<mask>
<maskelement>
<mename>generic</mename>
<mevalue>4</mevalue>
</maskelement>
</mask>
<uei>uei.opennms.org/generic/traps/SNMP_Authen_Failure</uei>
<event-label>OpenNMS-defined trap event: SNMP_Authen_Failure</event-label>
<descr> <p>An authentication failure trap signifies that the sending protocol entity is the addressee of a protocol message that is not proper$
<logmsg dest="discardtraps">Incorrect Community Name (authenticationFailure Trap) enterprise:%id% (%id%) args(%parm##%):%parm[all]%</logmsg>
<severity>Warning</severity>
<alarm-data reduction-key="%uei%:%dpname%:%nodeid%"
alarm-type="3" auto-clean="false"/>
</event>
Now I don't know the reason why I am the only one who seems to get this error. Maybe the config wasn't default, or maybe the 8 community strings we try on each node generated too many traps (eventid 25.282.511 and counting in 3 months). In any case, it seems this functionality creates race conditions on systems with heavily loaded databases and rapidly changing alarms.
I am going to leave this bug as unresolved, as this seems unwanted behavior, but at least now I have a workaround. Feel free to close the issue if you think no further action is needed.
The bug is a dashboard error that occasionally pops up. Whenever it pops up no alarms are loaded into to alarms dashlet, so I am mainly investigating the alarms table. When the exception is thrown the enclosed stacktrace is thrown into the uncategorized.log file.
The frequency of the exception seems directly correlated to my events table size. When I started on this project the table had grown to 4.5 million events in just 3 months. This was mostly due to a couple of traps that were spammed every few of seconds. After working the table down to 800k the exception is less frequent.
I have been monitoring the alarms table and I noticed there are a few alarms that update their last eventid field about every 3 seconds. Here is the trap information:
UEI: uei.opennms.org/generic/traps/SNMP_Authen_Failure
Logmsg: Incorrect Community Name (authenticationFailure Trap) enterprise:.1.3.6.1.6.3.1.1.5 (.1.3.6.1.6.3.1.1.5) args(3):.1.3.6.1.4.1.9.2.1.5.0="<IP ONMS server>" .1.3.6.1.4.1.9.9.412.1.1.1.0="1" .1.3.6.1.4.1.9.9.412.1.1.2.0="IP ONMS server"
When I copy the lasteventid and look it up in the events table, I don't find any results. I thought the id might be to an event that didn't exist but the following SQL-statement returns zero rows:
select alarmid, logmsg, lasteventid from alarms where lasteventid not in (select eventid from events);
alarmid | logmsg | lasteventid
--------+-------+------------- (0 rows)
So now I propose that the last event is removed between the time the alarms are read and the event is looked up. I don't know why the events are removed, as I didn't configured OpenNMS to do so.
I realize I shouldn't be receiving the same trap every other second and I will troubleshoot this, but I would be grateful for any other insight you can give me into this issue.