OpenNMS Loses Events if Elasticsearch is down

Description

Currently, if the opennms-es-rest event forwarder loses connectivity to our Elasticsearch server, it never tries to re-establish the connection or at least stops sending HTTP requests to Elasticsearch until either OpenNMS is restarted or we re-install the feature from the karaf console. While it is in failed state, it continues to consume from Kafka and commits the offsets.

This can be easily reproduced by enabling firewall on the OpenNMS server and blocking the outgoing Elasticsearch port. Once the opennms-es-rest forwarder has failed, stop the firewall and allow traffic outgoing to Elasticsearch.

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Jesse White March 29, 2017 at 8:35 AM

PR: https://github.com/OpenNMS/opennms/pull/1380

Seth Leger March 13, 2017 at 10:42 AM

This commit will fix the problem where further attempts to send to Elasticsearch fail until the feature (or OpenNMS) is restarted:

https://github.com/OpenNMS/opennms/commit/9fb5b7b91c6660d14a7c35a61e130f5499c537b6

It sounds like we also need to add configurable retries to the send operation so that transient outages don't result in dropped messages.

As far as round-robin sends to different Elasticsearch URIs, it doesn't look like that's a feature of the Jest library that we use so we'll have to write support for that. I'll open a separate issue for that.

Tim Fite March 11, 2017 at 1:30 PM

Do you have an ETA on when this might be fixed?

Tim Fite March 11, 2017 at 12:38 PM

Yep, that is the exception we have been seeing. As a possible future enhancement, it might help if the elasticsearchUrl could take a comma delimited list of elasticsearch nodes that it could distribute the HTTP calls across.

Seth Leger March 10, 2017 at 11:02 AM

It appears that a single exception is thrown and then processing stops:

Fixed

Details
Assignee
Seth Leger
Reporter
Tim Fite
Components
Sprint
None
Fix versions
19.1.0
Affects versions
19.0.1
Priority
Blocker

PagerDuty

Created March 7, 2017 at 7:23 PM

Updated March 29, 2017 at 10:39 AM

Resolved March 29, 2017 at 8:35 AM

Configure

OpenNMS Loses Events if Elasticsearch is down

Description

Acceptance / Success Criteria

Lucidchart Diagrams

Activity

Jesse White March 29, 2017 at 8:35 AM

Seth Leger March 13, 2017 at 10:42 AM

Tim Fite March 11, 2017 at 1:30 PM

Tim Fite March 11, 2017 at 12:38 PM

Seth Leger March 10, 2017 at 11:02 AM

DetailsAssigneeSeth LegerSeth LegerReporterTim FiteTim FiteComponentsSprintNone+1Fix versions19.1.0Affects versions19.0.1PriorityBlocker

Details

Assignee

Reporter

Components

Sprint

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Details
Assignee
Seth Leger
Reporter
Tim Fite
Components
Sprint
None
Fix versions
19.1.0
Affects versions
19.0.1
Priority
Blocker

PagerDuty