Spotted thread leak in Syslogd localhost name lookups

Description

We have seen a large amount of thread from OpenNMS. A large amount means 32318. After stopping OpenNMS amount of threads was normal. Problem was indicated by running simple bash commands which exit with error message

bash: fork: Cannot allocate memory

and yum update with error

thread.error: can't start new thread

Environment

OpenNMS 1.11.3 java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.8) (rhel-1.56.1.11.8.el6_3-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) Centos 6 with 2.6.32-279.22.1.el6.x86_64

Acceptance / Success Criteria

None

Attachments

4

Lucidchart Diagrams

Activity

Show:

Seth Leger August 5, 2013 at 10:59 AM

Since this sounds like a JVM issue that we cannot code around, I'm going to resolve this. If you encounter problems with host name lookup threads as a user, please try upgrading to Java 7 to fix the issue. Java 6 is near (or maybe past) end-of-life support at this point anyway.

jcat August 2, 2013 at 11:23 AM

Well, since the java update over a month ago, we've not seen this at all.
I even managed to go a good 10 days without a restart for any reason, and still no issues

Great news, and from point of view, I can call it resolved.

Thanks for the assistance from all.

Cheers,
Just

jcat June 24, 2013 at 11:49 AM

Small update.

Since I last posted, we've seen this 3 times.
Today I deployed Oracle Java 1.7 (1.7.0_21) in production - and as previously indicated, we'll just have to see how it goes from now

Cheers,
Just

jcat June 7, 2013 at 6:01 AM

So I can confirm here's no bad /etc/hosts entry on the server, so I guess that points to something in jvm land.

I've been trying to reproduce this in our test environment, and so far no luck. (Same os, jvm , config, etc..)
I've fired 50,032,578 syslog messages (and counting..) so far in a 24 hour period. No dice
On the test server, I've also tried invalidating the localhost entry in /etc/hosts, still no luck!

So if I can't reproduce it, it really only leave me with one course of action. Upgrade the jvm in production and see how it goes.
I'll need to run the jvm in test for a while first (maybe a week or so) before upgrading prod.

I've created some oracle java 1.7 packages from the latest jdk on the oracle website, as I do so hate do things out side the package manager

So if all goes well in test, I'll have it running in prod in a week. At that point we'll just have to see how it goes.
The thread leak occurred three times in the space of a month previously, so if it survives a month without an incident I'd say it was a likely fix.

I'll keep you all informed.
Thanks for your input so far.

Cheers,
Just

Jeff Gehlbach June 6, 2013 at 5:01 PM

Thanks for the analysis, Ben.

Among systems where we've seen many threads stuck in that getLocalHostName() method, running with a Java 6 JVM has been a common thread. Therefore upgrading those systems to use Oracle Java 7 has been part of the solution. Given this finding, I suggest very strongly trying this. It may mean that you have to install the JDK outside the package system, which sucks, but if it's at all possible to try it please do.

Configuration

Details

Assignee

Reporter

Components

Affects versions

Priority

PagerDuty

Created April 15, 2013 at 10:55 AM
Updated August 5, 2013 at 10:59 AM
Resolved August 5, 2013 at 10:59 AM