Alarm processing is very slow when Kafka producer is enabled and Kafka is unavailable

Description

When the Kafka producer is enabled but Kafka is unavailable every attempt to push an alarm to the Kafka topic will block for 1 minute by default.

The call in OpenNMSKafkaProducer:sendRecord() ends up blocking on producer.send() if Kafka metadata cannot be obtained. This blocks ultimately because the Kafka client send() method attempts to get metadata with a default timeout of 1 minute (see http://kafka.apache.org/090/documentation.html "max.block.ms").

The way I produced this issue is by having a misconfigured "ADVERTISED_HOST" environment variable set for my Kafka container. I suspect there is other ways of reproducing, maybe just simply stopping Kafka would have the same result.

The alarms will eventually get processed after 1 minute of waiting each serially.

One potential fix would be to change the call to sendRecord so that it pushes a record to a bounded queue and have a separate thread sending records from that queue to Kafka so the OpenNMS alarmd thread is never blocked.

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Matthew Brooks October 16, 2018 at 2:54 PM

Matthew Brooks October 11, 2018 at 4:13 PM

This issue also seems to block other alarm handling. For example if you try to clear an alarm from the web UI it will also take >1 minute to execute if the kafka producer is enabled.

Matthew Brooks October 11, 2018 at 4:12 PM

With kafka forwarding enabled I was attempting to create an alarm via the event API. The alarm would eventually show up in the OpenNMS UI but only after the kafka producer timed out attempting to send the alarm.

Fixed

Details

Assignee

Reporter

Sprint

Fix versions

Affects versions

Priority

PagerDuty

Created September 27, 2018 at 8:35 PM
Updated October 17, 2018 at 1:24 PM
Resolved October 17, 2018 at 1:24 PM