Alarm processing is very slow when Kafka producer is enabled and Kafka is unavailable
Description
When the Kafka producer is enabled but Kafka is unavailable every attempt to push an alarm to the Kafka topic will block for 1 minute by default.
The call in OpenNMSKafkaProducer:sendRecord() ends up blocking on producer.send() if Kafka metadata cannot be obtained. This blocks ultimately because the Kafka client send() method attempts to get metadata with a default timeout of 1 minute (see http://kafka.apache.org/090/documentation.html "max.block.ms").
The way I produced this issue is by having a misconfigured "ADVERTISED_HOST" environment variable set for my Kafka container. I suspect there is other ways of reproducing, maybe just simply stopping Kafka would have the same result.
The alarms will eventually get processed after 1 minute of waiting each serially.
One potential fix would be to change the call to sendRecord so that it pushes a record to a bounded queue and have a separate thread sending records from that queue to Kafka so the OpenNMS alarmd thread is never blocked.
This issue also seems to block other alarm handling. For example if you try to clear an alarm from the web UI it will also take >1 minute to execute if the kafka producer is enabled.
Matthew Brooks October 11, 2018 at 4:12 PM
With kafka forwarding enabled I was attempting to create an alarm via the event API. The alarm would eventually show up in the OpenNMS UI but only after the kafka producer timed out attempting to send the alarm.
When the Kafka producer is enabled but Kafka is unavailable every attempt to push an alarm to the Kafka topic will block for 1 minute by default.
The call in OpenNMSKafkaProducer:sendRecord() ends up blocking on producer.send() if Kafka metadata cannot be obtained. This blocks ultimately because the Kafka client send() method attempts to get metadata with a default timeout of 1 minute (see http://kafka.apache.org/090/documentation.html "max.block.ms").
The way I produced this issue is by having a misconfigured "ADVERTISED_HOST" environment variable set for my Kafka container. I suspect there is other ways of reproducing, maybe just simply stopping Kafka would have the same result.
The alarms will eventually get processed after 1 minute of waiting each serially.
One potential fix would be to change the call to sendRecord so that it pushes a record to a bounded queue and have a separate thread sending records from that queue to Kafka so the OpenNMS alarmd thread is never blocked.