Hikari CP leaking threads

Description

Resource graph shows that Hikari connection pool is leaking ~1 thread every day. Eventually, it will hit the roof and remain there when you leave opennms running continuously long enough. At that point, opennms GUI will be unresponsive and the performance will be extremely degraded to the extend that it’s not functional. A full service restart is required to restore the service. This is seen in Meridian and Horizon. This was reported by several users.

Horizon 32.0.6

The sawtooth-like trend corresponds to restarts with which the threads state is restored.

Meridian 2023.1.9

Here it was left running long enough with a restart. The active threads continued to climb up to the roof.

They have one thing in common although I cannot attest this is the root cause. They all have JDBC collector configured to collect PostgreSQL stats.

In the case of Meridian when it’s completely degraded, the connections' state retrieved from pg_stat_activity table showed there are exactly 50 lingering idle queries and they are identical. See attached xslx file for full dump of the table. Most of them are old connection - more than a day old. I would expect the setting the idleTimeoutin the datasource config would remove the old connections but it doesn’t. In this case, both instances are configured with default value of 600s. This deserves an investigation as well.

select onmsnode0_.nodeSysOID as col_0_0_, count(*) as col_1_0_ from node onmsnode0_ left outer join pathOutage onmsnode0_1_ on onmsnode0_.nodeId=onmsnode0_1_.nodeId where onmsnode0_.nodeSysOID is not null group by onmsnode0_.nodeSysOID

Acceptance / Success Criteria

Setting idleTimeout should remove old idle connections accordingly

Attachments

08 Feb 2024, 03:32 PM
08 Feb 2024, 02:34 PM
08 Feb 2024, 02:31 PM
08 Feb 2024, 02:31 PM

Activity

Show:

Christian Pape March 1, 2024 at 12:44 PM

Merged.

Christian Pape February 28, 2024 at 10:35 AM

Please review:

PR: https://github.com/OpenNMS/opennms/pull/7121

JianYet February 23, 2024 at 4:50 PM

@Christian Pape DM’d you the log files collected when opennms went unresponsive.

Christian Pape February 23, 2024 at 6:52 AM

@JianYet The mentioned query is part of the audit phase in provisioning. We added there the async reverse lookups in NMS-15776. I want to check whether this introduced this problem.

JianYet February 22, 2024 at 5:16 PM

@Christian Pape Will get the logs for you. What would be the effect w.r.t Hikari CP when setting it to false. I need to consult the customer first and explain to them.

Fixed

Details
Assignee
Christian Pape
Reporter
JianYet
Labels
bugfixslasupport
HB Grooming Date
Feb 13, 2024
HB Backlog Status
Refined Backlog
Components
Database
Sprint
None
Fix versions
Meridian-2023.1.14
33.0.2
Affects versions
Meridian-2023.1.12
32.0.6
Priority
Major

PagerDuty

Created February 8, 2024 at 2:31 PM

Updated May 20, 2024 at 6:34 AM

Resolved March 1, 2024 at 12:44 PM

Configure

Hikari CP leaking threads

Description

Acceptance / Success Criteria

Attachments

Activity

Christian Pape March 1, 2024 at 12:44 PM

Christian Pape February 28, 2024 at 10:35 AM

JianYet February 23, 2024 at 4:50 PM

Christian Pape February 23, 2024 at 6:52 AM

JianYet February 22, 2024 at 5:16 PM

Details
Assignee
Christian Pape
Reporter
JianYet
Labels
bugfixslasupport
HB Grooming Date
Feb 13, 2024
HB Backlog Status
Refined Backlog
Components
Database
Sprint
None
Fix versions
Meridian-2023.1.14
33.0.2
Affects versions
Meridian-2023.1.12
32.0.6
Priority
Major

Details

Assignee

Reporter

Labels

HB Grooming Date

HB Backlog Status

Components

Sprint

Fix versions

Affects versions

Priority

PagerDuty

PagerDuty

Flag notifications

Something's gone wrong

Hikari CP leaking threads

Description

Acceptance / Success Criteria

Attachments

Activity

Christian Pape March 1, 2024 at 12:44 PM

Christian Pape February 28, 2024 at 10:35 AM

JianYet February 23, 2024 at 4:50 PM

Christian Pape February 23, 2024 at 6:52 AM

JianYet February 22, 2024 at 5:16 PM

DetailsAssigneeChristian PapeChristian PapeReporterJianYetJianYetLabelsbugfixslasupportHB Grooming DateFeb 13, 2024HB Backlog StatusRefined BacklogComponentsDatabaseSprintNone+1Fix versionsMeridian-2023.1.1433.0.2Affects versionsMeridian-2023.1.1232.0.6PriorityMajor

Details

Assignee

Reporter

Labels

HB Grooming Date

HB Backlog Status

Components

Sprint

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Flag notifications

Something's gone wrong

Details
Assignee
Christian Pape
Reporter
JianYet
Labels
bugfixslasupport
HB Grooming Date
Feb 13, 2024
HB Backlog Status
Refined Backlog
Components
Database
Sprint
None
Fix versions
Meridian-2023.1.14
33.0.2
Affects versions
Meridian-2023.1.12
32.0.6
Priority
Major

PagerDuty