Issues

Select view

Select search mode

 

Kafka Producer deadlock

Duplicate

Description

In a production deployment we have noticed that the Kafka Producer gets stuck in an apparent deadlock, which results in most event processing to halt.

The following stack trace was observed when in a such a state:

Acceptance / Success Criteria

None

Lucidchart Diagrams

Details

Assignee

Reporter

Priority

PagerDuty

Created August 10, 2021 at 12:49 PM
Updated August 16, 2021 at 6:48 PM
Resolved August 16, 2021 at 6:48 PM

Activity

Show:

Jesse WhiteAugust 16, 2021 at 6:48 PM

Duplicate of

Jesse WhiteAugust 16, 2021 at 6:47 PM

After some further investigation it looks like it can take a long time (8 minutes+) to load some of the HwEntity trees. This could explain the observed behavior.

Jesse WhiteAugust 10, 2021 at 12:57 PM

As a workaround the nodeTopic can be set to an empty string so that no node data is forward, and this mapping does not occur.

Jesse WhiteAugust 10, 2021 at 12:56 PM

Here is the line in the Kafka Producer code where it appears to be stuck:
https://github.com/OpenNMS/opennms/blob/opennms-28.0.1-1/features/kafka/producer/src/main/java/org/opennms/features/kafka/producer/ProtobufMapper.java#L148

There may be an issue with loading this specific HwEntity tree, or perhaps a deadlock with the database.

In these calls, the Alarmd already has a R/W DB transaction open, and another nested R/O transaction is opened by the NodeCache here: https://github.com/OpenNMS/opennms/blob/opennms-28.0.1-1/features/kafka/producer/src/main/java/org/opennms/features/kafka/producer/NodeCache.java#L77

Updating the code in the NodeCache to conditionally open the transaction (skip if one is already open) may help the problem.