Hikari CP leaking threads

Description

Resource graph shows that Hikari connection pool is leaking ~1 thread every day. Eventually, it will hit the roof and remain there when you leave opennms running continuously long enough. At that point, opennms GUI will be unresponsive and the performance will be extremely degraded to the extend that it’s not functional. A full service restart is required to restore the service. This is seen in Meridian and Horizon. This was reported by several users.

Horizon 32.0.6

The sawtooth-like trend corresponds to restarts with which the threads state is restored.

Meridian 2023.1.9

Here it was left running long enough with a restart. The active threads continued to climb up to the roof.

They have one thing in common although I cannot attest this is the root cause. They all have JDBC collector configured to collect PostgreSQL stats.

In the case of Meridian when it’s completely degraded, the connections' state retrieved from pg_stat_activity table showed there are exactly 50 lingering idle queries and they are identical. See attached xslx file for full dump of the table. Most of them are old connection - more than a day old. I would expect the setting the idleTimeoutin the datasource config would remove the old connections but it doesn’t. In this case, both instances are configured with default value of 600s. This deserves an investigation as well.

select onmsnode0_.nodeSysOID as col_0_0_, count(*) as col_1_0_ from node onmsnode0_ left outer join pathOutage onmsnode0_1_ on onmsnode0_.nodeId=onmsnode0_1_.nodeId where onmsnode0_.nodeSysOID is not null group by onmsnode0_.nodeSysOID

Acceptance / Success Criteria

Setting idleTimeout should remove old idle connections accordingly

Attachments

4
  • 08 Feb 2024, 03:32 PM
  • 08 Feb 2024, 02:34 PM
  • 08 Feb 2024, 02:31 PM
  • 08 Feb 2024, 02:31 PM

Activity

Show:

Christian Pape March 1, 2024 at 12:44 PM

Merged.

Christian Pape February 28, 2024 at 10:35 AM

JianYet February 23, 2024 at 4:50 PM

DM’d you the log files collected when opennms went unresponsive.

Christian Pape February 23, 2024 at 6:52 AM

The mentioned query is part of the audit phase in provisioning. We added there the async reverse lookups in NMS-15776. I want to check whether this introduced this problem.

JianYet February 22, 2024 at 5:16 PM

Will get the logs for you. What would be the effect w.r.t Hikari CP when setting it to false. I need to consult the customer first and explain to them.

Fixed

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Components

Sprint

Affects versions

Priority

PagerDuty

Created February 8, 2024 at 2:31 PM
Updated May 20, 2024 at 6:34 AM
Resolved March 1, 2024 at 12:44 PM

Flag notifications