Minion doesn't recover when Kafka becomes available

Description

A customer is interested in using the off-heap feature of Minion to be able to keep messages on memory when the Kafka cluster is unavailable or there are issues with Minion's network, and then forward the messages to the cluster when the Minion can reach it again.

I'm attaching a file called docker-test.tar.gz. It includes a docker-compose.yaml with a Zookeeper cluster (3 instances), a Kafka cluster (5 instances, to emulate what the customer has), a PG instance, an OpenNMS instance, and a Minion instance. Make sure you have enough resources on the machine where Docker is running to start it. It includes all the configuration necessary via overlay to have Kafka for RPC/Sink and have the off-heap feature enabled.

After having the whole solution working, I've used the udpgen tool to send traps to the Minion (one every second). I was able to see via trapd.log in DEBUG mode that OpenNMS was receiving the traps. As the testing trap is not recognized by OpenNMS it uses the Enterprise Default definition, which keeps the last event on the database, and updates an alarm. The alarm count will reflect how many traps have been sent, and this number can be correlated with the number that udpgen says it is sending..

To test the off-heap feature, I've stopped the Kafka instances one by one using docker-compose stop, while tailing trapd.log in OpenNMS. Once all the brokers are stopped, there is no activity as expected. I waited a few seconds, and then started the brokers one by one, until the whole cluster was active again. I'm using healthcheck on every instance to know if they are running properly (using docker-compose ps to verify).

I saw a group of messages coming in once the cluster was available again which might be the result of flushing the off-heap content. Then, OpenNMS stopped receiving traps but the udpgen tool was never stopped. While investigating what happened, I found that karaf.log on Minion has being flooded with messages like this:

It looks like Minion is not able to properly recover after having the whole Kafka cluster unavailable, and then making sure it is available again.

Acceptance / Success Criteria

None

Attachments

1

Lucidchart Diagrams

Activity

Show:

Alejandro Galue January 30, 2020 at 7:19 PM

I've tried multiple times to reproduce the problem, but the solution seems to work as expected. The kafka client is able to recover and reconnect to the cluster after having a full outage on the cluster.

Alejandro Galue January 30, 2020 at 4:34 PM
Edited

I'm not sure what I did differently when I opened this issue (also because that was the same error the customer reported), but I made 2 experiments, one with a 5min down and one for 30min down of the whole Kafka cluster. A few minutes after putting the cluster back on-line, communication was restored in both cases (for RPC and Sink, and the associated monitored services). I noticed a small burst that might imply that the off-heap was flushed, but I don't have a way to confirm that was the case.

I mean, due to how udpgen works, I can't say that all the traps from the off-heap were sent, meaning I would have to build a tool on which I can include custom parameters, or modify udpgen, as the customer needs to see the off-heap feature working on their environment.

At least the good news is that it seems there is no problem with the Kafka client. Just to be sure, I'm going to rebuild the test one more time this afternoon from scratch to be sure and then close this issue as "CANNOT REPRODUCE" if it works again.

Alejandro Galue January 30, 2020 at 3:30 PM

I noticed something else, you don't even have to use the udpgen tool. As soon as you start stopping Kafka brokers, the error will appear on the RPC topics, leading to having RPC down and being unable to collect data, so the problem is at Kafka client that affects both RPC and Sink.

Alejandro Galue January 30, 2020 at 3:19 PM

At least 15 min if I recall correctly, but as the test is dead simple to reproduce, I'm going to start it and leave it running to see if it eventually recovers. I'll do it now, and report after 2 hours having it in that failed/unrecoverable state.

Jesse White January 28, 2020 at 3:24 PM

How long did we wait for after bringing the Kafka cluster back up?

Cannot Reproduce

Details

Assignee

Reporter

Components

Affects versions

Priority

PagerDuty

Created January 28, 2020 at 3:13 PM
Updated January 30, 2020 at 7:19 PM
Resolved January 30, 2020 at 7:19 PM