Race condition on ALEC's config bundle after installation

Description

Even before ALEC was named that way (when it was OCE), I found that the bundle called ALEC :: Integrations :: OpenNMS :: Config sometimes starts correctly, and sometimes doesn't; observed on all its version, tested on pre-H24, H24, H25, and H26.

The behavior is random, and hard to reproduce consistently, which is why I think it could be a race condition.

The only workaround I found is doing the following:

After that, all the custom event definitions for ALEC are loaded as expected, and the advanced correlation features work as intended. Without it, some events won't have a proper definition and because of that, the correlations won't trigger.

Acceptance / Success Criteria

None

Attachments

1

Lucidchart Diagrams

Activity

Show:

Stefan Wachter May 7, 2021 at 7:06 AM

merged into foundation-2019

Stefan Wachter April 30, 2021 at 7:02 AM

Stefan Wachter April 28, 2021 at 9:12 AM
Edited

Result so far:

  • The ConfigReloadContainer contains a list of providers. The SyslogMatchExtensionManager is initially added to that list.

  • When the ALEC feature is installed a couple of bundles get stopped first. In particular, the api-layer bundle is stopped. The SyslogMatchExtensionManager is part of that bundle and therefore is removed from the provider list of the ConfigReloadContainer

  • When the api-layer gets started again the SyslogMatchExtensionManager is recreated.

  • Two things happen in parallel now:
    1. The SyslogMatchExtensionManager is eventually registered at the ConfigReloadContainer (via the OnmsOSGIBridgeActivator)
    2. The SyslogMatchExtensionManager.bind method is called with the SyslogMatchExtension provided by the ALEC feature. This triggers a reload of the configuration (I am not completely sure of the execution sequence)

  • It can happen that the reload is done before the new SyslogMatchExtensionManager is registered with the ConfigReloadContainer.

I do not understand why the api-layer bundle gets stopped at all. Maybe this is the root cause of the problem. The alec-install.txt attachment shows what's going on when the ALEC feature is installed.

I am not sure how to fix this. One possibility would be to somehow trigger the DefaultEventConfDao.reload method. That method seems already being called on various occasions. Maybe the ConfigReloadContainer can trigger it indirectly similar to how the SyslogMatchExtensionManager triggers a reload when its bind method is called.

PS: I tried if adding `eventConfDao.reload()` to the `SyslogMatchExtensionManager.triggerReload` method might help - but it didn't. (Inspired by the EventConfExtensionManager that does the same thing.)

Matthew Brooks June 22, 2020 at 10:34 PM

Ok thanks, makes sense.

Alejandro Galue June 22, 2020 at 9:21 PM

I'm sorry, I forgot to mention that.

Actually, If I follow what's described in the official docs to test ALEC, you'll see that, sometimes, the situation won't be created. When this is the case, the following is what I use to see if the configuration was loaded correctly:

When it doesn't work, that returns nothing, and when it works you should see:

Makes sense?

Fixed

Details

Assignee

Reporter

Labels

HB Grooming Date

HB Backlog Status

Components

Sprint

Affects versions

Priority

PagerDuty

Created June 22, 2020 at 9:00 PM
Updated May 11, 2021 at 1:54 PM
Resolved May 7, 2021 at 7:06 AM