Race condition on ALEC's config bundle after installation
Description
Acceptance / Success Criteria
Attachments
related to
Lucidchart Diagrams
Activity
Stefan Wachter May 7, 2021 at 7:06 AM
merged into foundation-2019
Stefan Wachter April 28, 2021 at 9:12 AMEdited
Result so far:
The
ConfigReloadContainer
contains a list of providers. TheSyslogMatchExtensionManager
is initially added to that list.When the ALEC feature is installed a couple of bundles get stopped first. In particular, the
api-layer
bundle is stopped. TheSyslogMatchExtensionManager
is part of that bundle and therefore is removed from the provider list of theConfigReloadContainer
When the
api-layer
gets started again theSyslogMatchExtensionManager
is recreated.Two things happen in parallel now:
1. TheSyslogMatchExtensionManager
is eventually registered at theConfigReloadContainer
(via theOnmsOSGIBridgeActivator
)
2. TheSyslogMatchExtensionManager.bind
method is called with theSyslogMatchExtension
provided by the ALEC feature. This triggers a reload of the configuration (I am not completely sure of the execution sequence)It can happen that the reload is done before the new
SyslogMatchExtensionManager
is registered with theConfigReloadContainer
.
I do not understand why the api-layer
bundle gets stopped at all. Maybe this is the root cause of the problem. The alec-install.txt
attachment shows what's going on when the ALEC feature is installed.
I am not sure how to fix this. One possibility would be to somehow trigger the DefaultEventConfDao.reload
method. That method seems already being called on various occasions. Maybe the ConfigReloadContainer
can trigger it indirectly similar to how the SyslogMatchExtensionManager
triggers a reload when its bind
method is called.
PS: I tried if adding `eventConfDao.reload()` to the `SyslogMatchExtensionManager.triggerReload` method might help - but it didn't. (Inspired by the EventConfExtensionManager
that does the same thing.)
Matthew Brooks June 22, 2020 at 10:34 PM
Ok thanks, makes sense.
Alejandro Galue June 22, 2020 at 9:21 PM
I'm sorry, I forgot to mention that.
Actually, If I follow what's described in the official docs to test ALEC, you'll see that, sometimes, the situation won't be created. When this is the case, the following is what I use to see if the configuration was loaded correctly:
When it doesn't work, that returns nothing, and when it works you should see:
Makes sense?
Details
Assignee
Stefan WachterStefan WachterReporter
Alejandro GalueAlejandro GalueLabels
HB Grooming Date
Jul 07, 2020HB Backlog Status
Backlog Sprint
NonePriority
Major
Details
Details
Assignee
Reporter
Labels
HB Grooming Date
HB Backlog Status
Sprint
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

Even before ALEC was named that way (when it was OCE), I found that the bundle called
ALEC :: Integrations :: OpenNMS :: Config
sometimes starts correctly, and sometimes doesn't; observed on all its version, tested on pre-H24, H24, H25, and H26.The behavior is random, and hard to reproduce consistently, which is why I think it could be a race condition.
The only workaround I found is doing the following:
After that, all the custom event definitions for ALEC are loaded as expected, and the advanced correlation features work as intended. Without it, some events won't have a proper definition and because of that, the correlations won't trigger.