Improve resilience when writing flows to a malfunctioning ES cluster

Description

While writing flows to Elasticsearch I have noticed 100% CPU usage from OpenNMS and the flow processing stopped. In the telemetryd.log I've seen this error message:

in karaf.log I've found this error message:

The high CPU usage was produced by Telemetryd and the flow handling which can be seen in here:

The Elasticsearch cluster state is green and I haven't seen a problem standing out in the first place.

After further investigation, I have figured out there is a default hard limit of 1000 shards per node. I was able

I had to update the max_shards_per_node settings manually.

It seems like Elasticsearch gives an error message when we do a bulk write request and OpenNMS fails to process the error message correctly. This situation seems to cause also some problems with threads stuck in a loop causing the high CPU usage.

Acceptance / Success Criteria

None

Attachments

2

Lucidchart Diagrams

Activity

Show:

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Components

Affects versions

Priority

PagerDuty

Created January 11, 2022 at 10:43 AM
Updated January 18, 2022 at 7:56 PM