Performance issues after upgrading from Horizon 18.0.x to 21.0.2

Description

After upgrading three Horizon 18.0.x setups to Horizon 21.0.2, I got some performance issues. A few minutes after the start of OpenNMS, the WebUI was very slow with response times of around 30sec to get pages loaded. At the same time the system load was very high and the Poller detected some "false" outages. Also, the RTC view on the startpage could not be updated fast enough. The graph "OpenNMS Pollerd Threads Active"shows, that the full Pollerd threadpool is used, and the "Pollerd Task Queue" value was high.

This behavior holds on for 1 or 2 hours. After that, the performance issue disappeared.

I did some further testing. With disabled Provisiond, there was no performance issue. Also I enabled Provisiond and set the org.opennms.provisiond.scheduleRescanForExistingNodes option in opennms.properties to false. With that setting, I also did not run into the performance issue after restarting OpenNMS. But from time to time (a few times a week, I can not see a rule), the performance issue appeared again.

During the performance problem, I see a lot of these error messages in poller.log:

In output.log, I found some of that execeptions. Unfortunately, I cannot see the exact time, when they appeared:

Environment

Ubuntu 14.04 Oracle Java JDK 8u152 around 575 nodes in requisitions

Acceptance / Success Criteria

None

Attachments

Lucidchart Diagrams

Activity

Show:

Jesse White April 17, 2019 at 8:06 PM

Looks like a similar problem with what was fixed in .

Jeff Gehlbach September 20, 2018 at 4:38 PM

Michael indicates this problem seems rooted in an OID-not-increasing situation caused by a bug in a Fortigate firewall, but not unique to those devices. We should look at adding some defensive code to detect this behavior and punt on data collection when it is noted. Should be easily reproduced.

Michael Batz July 13, 2018 at 8:54 AM

After further debugging, the issue seemed to be caused by an erroneous SNMP agent on a specific node (a Fortinet firewall).

Please see the output of snmpwalk:

There were multiple values in the IF-MIB with index 1.

After removing this device from our OpenNMS setup, the performance issue seems to be gone. I also got a much smaller amount of SnmpResult objects in memory (around 10,000 instead of a few millions) and a different picture of the Java Heap:

Michael Batz June 14, 2018 at 2:49 PM

With G1GC the CPU usage was not so high as with ParallelGC, but after around one day, I ran into an OutOfMemory Exception, so i switched back to ParallelGC. If you look at the Old Gen graph above, there seemed to be a problem for the Garbage Collector to clean some objects for a time of around midnight. So I used the jmap tool of the Java JDK to generate a histogram of the java object heap:

/opt/java/bin/jmap -histo <PID>

(Jun 14 14:04)

6 minutes later:

(Jun 14 14:10)

(Jun 14 14:30)

As you can see above, the number of org.opennms.netmgt.snmp.SnmpResult is very high and increasing continued.

Michael Batz June 10, 2018 at 12:49 PM

I did some further debugging. It seems that Garbage Collection causes higher CPU utilization and some timeouts in Polling and WebUI. The graph "JVM Heap" in the attached screenshot shows that it takes Garbage Collection a few hours to clean objects in Old Gen. At this time I used the default ParallelGC, I'll give G1GC a try.

Cannot Reproduce

Details

Assignee

Jeff Gehlbach

Reporter

Michael Batz

Labels

support

Components

Affects versions

21.0.2

Priority

Major

PagerDuty

Created January 19, 2018 at 3:56 PM

Updated April 17, 2019 at 8:06 PM

Resolved April 17, 2019 at 8:06 PM

Configure