Default GC and fullGC automations clash with alarm-driven status view in topo map

Description

Steps to illustrate problem:

1. Add a node with at least one interface and one service to the system
2. Add that node to a topo map and save the map
3. Switch the topo map's display mode to "Status" and observe that the node's status is "NodeUp"
4. Take the node down so that a nodeDown event and alarm are created
5. Refresh the topo map and observe that the node's status is now "NodeDown"
6. Wait at least eight days* without bringing the monitored node back online or disturbing the OpenNMS instance
7. View the topo map in "Status" mode and check the node's status

Expected result: node still shows a "NodeDown" status
Actual result: node now shows a "NodeUp" or possibly "SeeEventDetails" status

*Instead of waiting eight days, you could adjust the "fullGarbageCollect" action in vacuumd-configuration.xml to specify a shorter interval than '8 days'

The problem is that the "fullGarbageCollect" action in vacuumd-configuration.xml will delete every alarm in the system that has sat for more than eight days since it last had an event reduced onto it ("lasteventtime") or was affected by an automation ("lastautomationtime"). Likewise, the "garbageCollect" action will do the same for unacknowledged alarms that have sat in the same way for at least three days. This includes alarms with the four UEIs whose presence for a node the topo map uses to determine that node's status:

uei.opennms.org/nodes/nodeDown
uei.opennms.org/nodes/interfaceDown
uei.opennms.org/generic/traps/SNMP_Link_Down
uei.opennms.org/nodes/nodeLostService

Environment

Any system with a stock vacuumd-configuration.xml and map.properties

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

David Hustace July 16, 2012 at 2:31 PM

Maps are a good training and diagnostic tool, when they're done well, but they are a very blunt instrument to be monitoring status when you consider status propagation, etc. What do we do, make a "New User" setting and a toggle for "Novice User"? I understand your point, however, if alarms are going to lay around for days and days and weeks and weeks then are they really "alarms"?

What would be a good behavior for your use case is to not delete any alarms that are acknowledge and have a cleanup automation that automatically acknowledges vs. deleting.

However, when you say New OpenNMS users, I would hesitate to not have the standard configuration not delete old stale smelling alarms because what we probably end up with is worse than inaccurate maps and more likely a system with serious performance issues.

Dave Caplinger July 13, 2012 at 2:19 PM

Understood, but the default out-of-the-box situation is that maps will lie to you after either 3 or 8 days (depending on the situation as described above) and make things look good that are actually bad, unless you take some specific automation-customization action beforehand. New OpenNMS users will likely get burned by this before ever knowing it's something they would want to customize.

David Hustace July 13, 2012 at 11:52 AM

The point of making automations configurable is just that, adjust them to your workflow. The defaults are based on a default NOC env where alarms are used as a work item list. If there is a better default configuration, I'm certainly open to that.

Jeff Gehlbach September 16, 2011 at 7:18 AM

Antonio, I agree, going back to events would be a bad idea. The point of this bug is to correct the default automations so that a down node will not mysteriously appear on the map as "NodeUp" or "SeeEventDetails" after three or eight days!

It took me a while to understand it, but the way the map analyzes alarms makes sense!

Antonio Russo September 16, 2011 at 2:37 AM

Well, the maps have been created with the idea of giving a visual snapshot of the network status. But the network status in opennms is given by the alarms. So I got the active alarm on a node to set the "color" of the node on the map.
Changing to "events" (as it was in the path before alarms) is really hard!

Details
Assignee
Jeff Gehlbach
Reporter
Jeff Gehlbach
Labels
RBsupport
Components
Example configuration files
Affects versions
1.8.14
1.9.91
Priority
Major

PagerDuty

Created September 15, 2011 at 3:44 PM

Updated September 21, 2021 at 6:22 PM

Configure

Default GC and fullGC automations clash with alarm-driven status view in topo map

Description

Environment

Acceptance / Success Criteria

Lucidchart Diagrams

Activity

David Hustace July 16, 2012 at 2:31 PM

Dave Caplinger July 13, 2012 at 2:19 PM

David Hustace July 13, 2012 at 11:52 AM

Jeff Gehlbach September 16, 2011 at 7:18 AM

Antonio Russo September 16, 2011 at 2:37 AM

Details
Assignee
Jeff Gehlbach
Reporter
Jeff Gehlbach
Labels
RBsupport
Components
Example configuration files
Affects versions
1.8.14
1.9.91
Priority
Major

Details

Assignee

Reporter

Labels

Components

Affects versions

Priority

PagerDuty

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

Default GC and fullGC automations clash with alarm-driven status view in topo map

Description

Environment

Acceptance / Success Criteria

Lucidchart Diagrams

Activity

David Hustace July 16, 2012 at 2:31 PM

Dave Caplinger July 13, 2012 at 2:19 PM

David Hustace July 13, 2012 at 11:52 AM

Jeff Gehlbach September 16, 2011 at 7:18 AM

Antonio Russo September 16, 2011 at 2:37 AM

DetailsAssigneeJeff GehlbachJeff GehlbachReporterJeff GehlbachJeff GehlbachLabelsRBsupportComponentsExample configuration filesAffects versions1.8.141.9.91PriorityMajor

Details

Assignee

Reporter

Labels

Components

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
Jeff Gehlbach
Reporter
Jeff Gehlbach
Labels
RBsupport
Components
Example configuration files
Affects versions
1.8.14
1.9.91
Priority
Major

PagerDuty