DB installer tries to class-load every file under OPENNMS_HOME as a JAR

Description

To reproduce this issue without a ton of actual RRD files, just make a bazillion ~40KB garbage files beneath share/rrd/snmp. Half a million files ought to be a good start. I'm not exaggerating – this kind of scale exists in the wild.

Run the installer. You'll see it get through most of its work, then apparently hang after printing:

- checking if database "opennms" is unicode... ALREADY UNICODE - Checking for old import files in /opt/opennms/etc... DONE

(The last messages printed may vary from one release to the next)

You'll see the logs go silent except for gc.log (if you've got that one configured) and maybe output.log. If you run lsof during this period, you'll see that the JVM is opening each of those files. I've seen this go on for hours, which is a big deal on production systems.

This behavior is apparently down to the way Spring does class loading, and it's scary difficult to fix. Fortunately there is a workaround: temporarily remove the symbolic link named share from OPENNMS_HOME, and restore it after the installer completes but before starting OpenNMS again. Assuming you installed from packages, this trick is available. Just be sure to note the path that the symlink pointed to before removing it (it may not point where you think it does, especially on RPM-based installs where the path is relocatable).

Environment

Any system with RRDTool or JRobin persisting and many (many, many) RRD files in {{OPENNMS_HOME/share/rrd}}, especially on rotational disks

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Jesse White February 27, 2017 at 4:53 PM

Jesse White February 24, 2017 at 3:06 PM
Edited

I was able to reproduce this by create a number of small files in share/rrd/snmp using:

for x in `seq 1 500000`; do fallocate -l 40K file$x.dat done

And by running strace on the installer:

sudo strace -f java -XX:+UseG1GC -Xmx2g -Dopennms.bootstrap.debug=true -Dopennms.home=/home/jesse/git/opennms/target/opennms-19.0.0-SNAPSHOT -Dlog4j.configurationFile=/home/jesse/git/opennms/target/opennms-19.0.0-SNAPSHOT/etc/log4j2-tools.xml -cp /home/jesse/git/opennms/target/opennms-19.0.0-SNAPSHOT/lib/opennms_bootstrap.jar org.opennms.bootstrap.InstallerBootstrap -dis

I was able to confirm the stat calls:

[pid 28128] stat("/home/jesse/git/opennms/target/opennms-19.0.0-SNAPSHOT/share/rrd/snmp/file6861.dat", {st_mode=S_IFREG|0664, st_size=40960, ...}) = 0 [pid 28128] stat("/home/jesse/git/opennms/target/opennms-19.0.0-SNAPSHOT/share/rrd/snmp/file50905.dat", {st_mode=S_IFREG|0664, st_size=40960, ...}) = 0 [pid 28128] stat("/home/jesse/git/opennms/target/opennms-19.0.0-SNAPSHOT/share/rrd/snmp/file79827.dat", {st_mode=S_IFREG|0664, st_size=40960, ...}) = 0

I was also able to grab a stack-trace of the JVM during that same run.

Converting the 28128 to HEX gives us 0x6de0, which can be used to find the associated thread in the stack-trace (searching for nid=0x6de0):

"Main" #11 prio=5 os_prio=0 tid=0x00007f57354a0800 nid=0x6de0 runnable [0x00007f56d03c1000] java.lang.Thread.State: RUNNABLE at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) at java.io.File.isDirectory(File.java:849) at org.opennms.bootstrap.Bootstrap$1.accept(Bootstrap.java:80) at java.io.File.listFiles(File.java:1291) at org.opennms.bootstrap.Bootstrap.loadClasses(Bootstrap.java:194) at org.opennms.bootstrap.Bootstrap.loadClasses(Bootstrap.java:198) at org.opennms.bootstrap.Bootstrap.loadClasses(Bootstrap.java:198) at org.opennms.bootstrap.Bootstrap.loadClasses(Bootstrap.java:198) at org.opennms.bootstrap.Bootstrap.loadClasses(Bootstrap.java:158) at org.opennms.install.Installer.install(Installer.java:264) at org.opennms.install.Installer.main(Installer.java:989) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.opennms.bootstrap.Bootstrap$4.run(Bootstrap.java:464) at java.lang.Thread.run(Thread.java:745)

So, our own Bootstrap code is the culprit.

Fixed

Details

Assignee

Reporter

Labels

Components

Sprint

Priority

PagerDuty

Created October 31, 2016 at 11:24 AM
Updated March 1, 2017 at 9:47 AM
Resolved March 1, 2017 at 9:47 AM

Flag notifications