Provisiond leaks file handles, eventually causing "Too many open files" crashes

Description

Attached patch attempts to solve this issue by having only a single NioSocketConnector per timeout value and disposing them as soon as possible. I've had it soak testing for a week on our problematic test server and no crashes, so I believe it is at least making a difference in file handle consumption.

Original issue as copy/pasted from opennms-devel post:
I've been doing a lot of digging around various 'Too many open files' crashes we've been seeing locally, and I think I've pinned down a big leak of file descriptors in provisiond's use of org.apache.mina connectors.

What it's currently doing in AsyncBasicDetector#isServiceDetected:

  • For each service, create a new NioSocketConnector

  • Configure that connector with a handler, filters etc

  • Make a connection out, check for results etc

There seem to be two problems with this approach:

1) Constructing an NioSocketConnector creates a lot of 'anon_inode' and 'pipe' file descriptors - on one machine it was 8 & 12 respectiovely and on another 4/8, so I'm not sure quite what the difference is there (under linux, at least; I assume some equivalent under Windows). The actual connect() call only uses one more handle. This causes it to run out of descriptors a lot faster than expected.

2) If new NioSocketConnector() crashes due to a "Too many open files" exception, Mina sometimes just sort of falls over dead with "NoClassDefFoundError: Could not initialize class sun.nio.ch.FileDispatcher". This class does exist in my JVM (openjdk 6) and if I reflectively inspect it first, it sometimes stops the crashes happening. I'm pretty baffled there, to be honest. If it does get itself into this state, you can't close existing sockets, you can't open new ones; all the anon_inode and pipe FDs just sit there. This seems to tally with behaviour we've witnessed in opennms instances where we've had a Too many open files crash - lsof shows a few thousand pipe/anon_inode handles just sitting around long after the crash.

What I think Mina wants you to be doing is creating a single NioSocketConnector to reuse everywhere and using the optional IoSessionInitializer in .connect() to configure filters and attach state objects to the IoSession. This would take a moderate overhaul of AsyncBasicDetector, as the handler would need to be rewritten to be a singleton that takes some state using IoSession.get/setAttribute rather than having one handler per service detect attempt and probably a fair chunk of refactoring at the same time.

Acceptance / Success Criteria

None

Attachments

2

Lucidchart Diagrams

Activity

Show:

Seth Leger May 15, 2012 at 5:34 PM

I've simplified the ConnectionFactory code inside the MINA detector classes so that it will reuse Executor pools and dispose of connectors after they are used. This makes everything fast because it doesn't start up threads for every detector and it makes it recycle filehandles properly because it disposes of the NioSocketConnector instances aggressively. In unit tests, it appears that this problem is fixed. Marking as fixed.

commit d8cb7a1dc9e9e794770fd03bd1763a5d2f13dba8
commit 6f44d3702bb083e974a67d3116dde2882cedf670

Seth Leger May 14, 2012 at 11:24 AM

I'm currently working on improving the NioSocketConnector handling in 1.10, reassigning to myself.

Seth Leger April 18, 2012 at 10:41 AM

Changing fix version from 1.8.13 to 1.10.2 since it looks like the changes that went into 1.8 were not sufficient to completely fix this issue.

Seth Leger April 17, 2012 at 9:49 AM

This issue is still causing problems based on reports from the opennms-devel mailing list. I am also able to reproduce problems on my Linux VM inside Eclipse by running the AsyncDetectorFileDescriptorLeakTest. This test has had intermittent problems in the past in Bamboo but doesn't seem to fail on Bamboo at the moment.

I worked on making the test code more reliable and remember thinking that we're probably running into a hard-to-trigger bug inside Apache Mina. But I don't have enough information to say that for sure yet. It could also be that some of our async operations are just not playing nice with the Mina code.

Anyway, reopening this issue so that we can give it more attention.

Benjamin Reed February 10, 2012 at 2:15 PM

FYI, this was backported to pre-1.8.13 in 336dbc4db7cbf9dfcaee6f09c3947e7fdde746fa

Fixed

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

PagerDuty

Created July 25, 2011 at 11:01 AM
Updated August 9, 2017 at 2:51 PM
Resolved May 15, 2012 at 5:34 PM