Improve resilience when writing flows to a malfunctioning ES cluster

Description

While writing flows to Elasticsearch I have noticed 100% CPU usage from OpenNMS and the flow processing stopped. In the telemetryd.log I've seen this error message:

in karaf.log I've found this error message:

The high CPU usage was produced by Telemetryd and the flow handling which can be seen in here:

The Elasticsearch cluster state is green and I haven't seen a problem standing out in the first place.

After further investigation, I have figured out there is a default hard limit of 1000 shards per node. I was able

I had to update the max_shards_per_node settings manually.

It seems like Elasticsearch gives an error message when we do a bulk write request and OpenNMS fails to process the error message correctly. This situation seems to cause also some problems with threads stuck in a loop causing the high CPU usage.

Acceptance / Success Criteria

None

Attachments

Lucidchart Diagrams

Activity

Show:

Details
Assignee
Unassigned
Reporter
Ronny Trommer
HB Grooming Date
Jan 18, 2022
HB Backlog Status
NB
Components
Affects versions
29.0.4
Priority
Minor

PagerDuty

Created January 11, 2022 at 10:43 AM

Updated January 18, 2022 at 7:56 PM

Improve resilience when writing flows to a malfunctioning ES cluster

Description

Acceptance / Success Criteria

Attachments

Lucidchart Diagrams

Activity

DetailsAssigneeUnassignedUnassignedReporterRonny TrommerRonny TrommerHB Grooming DateJan 18, 2022HB Backlog StatusNBComponentsAffects versions29.0.4PriorityMinor

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Components

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Details
Assignee
Unassigned
Reporter
Ronny Trommer
HB Grooming Date
Jan 18, 2022
HB Backlog Status
NB
Components
Affects versions
29.0.4
Priority
Minor

PagerDuty