Default GC and fullGC automations clash with alarm-driven status view in topo map
Description
Environment
Acceptance / Success Criteria
Lucidchart Diagrams
Activity
David Hustace July 16, 2012 at 2:31 PM
Maps are a good training and diagnostic tool, when they're done well, but they are a very blunt instrument to be monitoring status when you consider status propagation, etc. What do we do, make a "New User" setting and a toggle for "Novice User"? I understand your point, however, if alarms are going to lay around for days and days and weeks and weeks then are they really "alarms"?
What would be a good behavior for your use case is to not delete any alarms that are acknowledge and have a cleanup automation that automatically acknowledges vs. deleting.
However, when you say New OpenNMS users, I would hesitate to not have the standard configuration not delete old stale smelling alarms because what we probably end up with is worse than inaccurate maps and more likely a system with serious performance issues.
Dave Caplinger July 13, 2012 at 2:19 PM
Understood, but the default out-of-the-box situation is that maps will lie to you after either 3 or 8 days (depending on the situation as described above) and make things look good that are actually bad, unless you take some specific automation-customization action beforehand. New OpenNMS users will likely get burned by this before ever knowing it's something they would want to customize.
David Hustace July 13, 2012 at 11:52 AM
The point of making automations configurable is just that, adjust them to your workflow. The defaults are based on a default NOC env where alarms are used as a work item list. If there is a better default configuration, I'm certainly open to that.
Jeff Gehlbach September 16, 2011 at 7:18 AM
Antonio, I agree, going back to events would be a bad idea. The point of this bug is to correct the default automations so that a down node will not mysteriously appear on the map as "NodeUp" or "SeeEventDetails" after three or eight days!
It took me a while to understand it, but the way the map analyzes alarms makes sense!
Antonio Russo September 16, 2011 at 2:37 AM
Well, the maps have been created with the idea of giving a visual snapshot of the network status. But the network status in opennms is given by the alarms. So I got the active alarm on a node to set the "color" of the node on the map.
Changing to "events" (as it was in the path before alarms) is really hard!
Steps to illustrate problem:
1. Add a node with at least one interface and one service to the system
2. Add that node to a topo map and save the map
3. Switch the topo map's display mode to "Status" and observe that the node's status is "NodeUp"
4. Take the node down so that a nodeDown event and alarm are created
5. Refresh the topo map and observe that the node's status is now "NodeDown"
6. Wait at least eight days* without bringing the monitored node back online or disturbing the OpenNMS instance
7. View the topo map in "Status" mode and check the node's status
Expected result: node still shows a "NodeDown" status
Actual result: node now shows a "NodeUp" or possibly "SeeEventDetails" status
*Instead of waiting eight days, you could adjust the "fullGarbageCollect" action in vacuumd-configuration.xml to specify a shorter interval than '8 days'
The problem is that the "fullGarbageCollect" action in vacuumd-configuration.xml will delete every alarm in the system that has sat for more than eight days since it last had an event reduced onto it ("lasteventtime") or was affected by an automation ("lastautomationtime"). Likewise, the "garbageCollect" action will do the same for unacknowledged alarms that have sat in the same way for at least three days. This includes alarms with the four UEIs whose presence for a node the topo map uses to determine that node's status:
uei.opennms.org/nodes/nodeDown
uei.opennms.org/nodes/interfaceDown
uei.opennms.org/generic/traps/SNMP_Link_Down
uei.opennms.org/nodes/nodeLostService
Suggested solutions:
1. Create a new automation that bumps the "lastautomationtime" of alarms with any of those four UEIs every day or two, so that they never gather three days' "dust"
2. Add a NOT IN constraint on the "eventuei" column to the existing "garbageCollect" and "fullGarbageCollect" automations so that alarms with those UEis will never get reaped
This issue was created in conjunction with https://mynms.opennms.com/Ticket/Display.html?id=636