Multiple OpenNMS feature stop working when the Events Forwarder cannot push content to Elasticsearch

Description

A customer has opened multiple support tickets with different problems apparently unrelated that were affecting one of their environments.

As soon as the ES Forwarder problem was solved (as reported in ), all the other problems magically auto-resolve themselves.

I don't think this is just a coincidence, meaning that the ES Forwarder was affecting multiple OpenNMS features at once when it was unable to push content to Elasticsearch, regardless of the reason.

Among the problems observed and reported by the customers while the ES Forwarder was blocked are:

1) The Poller was unable to complete all polling cycles.

2) The Collector was unable to complete all collection cycles.

3) There were random and constant holes on all the graphs from all the collected metrics, even knowing that the ScyllaDB cluster was working perfectly fine and fast as usual.

4) All the minions (no exceptions) reported multiple short outages for the Heartbeat and the RPC services constantly.

5) There was a huge lag in Syslog's Sink topic, preventing OpenNMS from consuming those messages sent by the minions through Kafka.

6) The metrics injection rate into Newts was fluctuating (besides the interruptions mentioned in [3]), falling outside the constant rate which during normal conditions is 4K samples per second for this customer.

Perhaps there were more, but the above proves that whatever state entered the ES Forwarder impacted other unrelated features to work properly. There should be some defensive code in place to avoid this from happening in the future.

Acceptance / Success Criteria

None

Activity

Show:

Jeff Gehlbach February 7, 2023 at 2:15 PM

Reassigning fixVersion as 32.0.0 since the PR for this issue targeted develop well after release-31.x and foundation-2023 were created.

Spawning a separate issue to evaluate back-porting.

Jeff Gehlbach January 10, 2023 at 7:51 PM

I’m re-targeting this issue to Horizon 31.0.4 so that we can proceed with the 31.0.3 release.

The PR needs to be re-targeted to foundation-2023.

Dmitri Herdt December 13, 2022 at 2:26 PM

PR:

Jeff Gehlbach June 2, 2022 at 8:17 PM

We're seeing this scenario play out on an internal production system (thanks to the ever-alert ), and I've seen it on end-user systems too. Updating affected versions and adding to backlog.

Alejandro Galue March 12, 2021 at 7:31 PM

As a similar situation happened recently (in terms of the symptoms), and the solution was restarting OpenNMS, the customer has decided to disable the Elasticsearch forwarders for events and alarms and keep the Kafka Producer for their integration with Elasticsearch.

However, if there are further improvements that can be added to the Elasticserach forwarders in the future so this customer in question can reconsider using it or avoid issues on potential new users, I believe it worth investing time on it.

Fixed

Details

Assignee

Reporter

HB Backlog Status

Components

Sprint

Fix versions

Priority

PagerDuty

Created November 24, 2020 at 10:36 PM
Updated February 7, 2023 at 2:22 PM
Resolved February 2, 2023 at 6:32 PM