Detect and Attempt to Restart Failed Drools Engines

Description

Some exceptions can cause a Drools engine to stop working entirely, while the Correlator module remains "running". In this state, OpenNMS will not stop cleanly, and must be killed.

One such exception is java.util.ConcurrentModificationException:

Please improve the Correlator to be able to detect when an engine has failed, and attempt to restart it.
If the engine cannot be started, there should be a notification mechanism, and it should be possible to stop OpenNMS without resorting to "kill $(cat ${OPENNMS_HOME}/logs/opennms.pid)".

The current state also prevents cluster management software from identifying that part of the application has failed - "service opennms status" still says it's Running.
There should be some way to signal a clustering tool that part of the application has failed, and it should be restarted.

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Chandra Gorantla December 6, 2018 at 4:55 PM

Glad you figured the issue.  

Will Keaney December 5, 2018 at 7:33 PM

Scratch that - the HashMap was changed to a ConcurrentHashMap, so updating to Meridian 2018 should fix it. Waiting for 2018.1.3 to include the fix from this issue.

Will Keaney December 5, 2018 at 7:28 PM

I've opened issue DROOLS-3413 with the Drools project, as it seems like evaluateQueriesForRule is modifying a HashMap in a non-threadsafe manner, and this hasn't been fixed in 7.7.

Chandra Gorantla December 4, 2018 at 9:31 PM

as we chatted on mattermost,  this fix would create a new thread that runs engine again.  The earlier thread which caught exception would cease to exist. 

This fix should help with the primary issue of OpenNMS getting stuck in unknown state. 

Will Keaney November 26, 2018 at 3:40 PM

@cgorantla, I opened a new case in RT that I think may be relevant: 5899. For that case, I also uploaded a zip file with thread dumps and a heap dump to the support dropbox.

I think this is relevant because the threads for the failed Drools engines still appear in the jstack thread dump, so I don't think they're exiting. They just stop responding.

Fixed

Details

Assignee

Reporter

Components

Sprint

Priority

PagerDuty

Created September 20, 2018 at 8:54 PM
Updated December 6, 2018 at 4:55 PM
Resolved November 13, 2018 at 2:26 AM