javax jmdns task cripples server? or collection problems

Description

Have been tracking an issue with a test server. It looks to grow worse with additional nodes. Currently the server will work fine for about 3-5 hours before becoming sluggish and quits collecting data.

I have seen occasional deadlock log entries but suspect they are side effects of the problem. The consistent log entries that coincide with the last timestamps of jrb updates are in output.log regarding javax.jmdns (attached, servername replaces real server name). At the same time I also see a consistent trend in JVM stat- OpenNMS Queued Operations Pending (graph attached). At about 34M is where onms consistently goes into this sluggish state with collected data not storing. Collectd.log shows thresholds being evaluated and entries for jrb storage, but jrbs are not updated.

Environment

RHEL 5.5 x86_64 OpenNMS 1.10.4 Sun JDK 1.6_30 2429 nodes 7827 interfaces 13891 services ~600,000 jrbs collected Postgres 9.1 on separate server

Acceptance / Success Criteria

None

Attachments

3

Lucidchart Diagrams

Activity

Show:

Seth Leger November 5, 2012 at 4:19 PM

Hi Ken,

I'm going to close this bug as cannot reproduce for now. Here are some suggestions that I have:

  • Upgrade to 1.10.6, there were several file handle exhaustion bugs fixed in that version.

  • Do a thread dump when the OpenNMS process is hung and see if many of the threads are hung on InetAddress.getLocalHost() calls. If so:

    • Check to see if a "localhost" DNS lookup returns both IPv4 (127.0.0.1) and IPv6 (::1) addresses. If not, add aliases to your /etc/hosts file.

    • If you send a lot of notifications, go into /opt/opennms/etc/javamail.properties and change the value of the property org.opennms.core.utils.mailHost from the default of "127.0.0.1" to "localhost"

Ken Eshelby August 13, 2012 at 12:55 PM

queued activity graph definition

Ken Eshelby August 13, 2012 at 12:53 PM

I am not sure how to classify this one.. overloaded Queued from collection load? Jeff mentioned another client doing more collections without an issue. We are up to 800-900k jrb files and the stats from disk i/o don't look much different than from 300k jrbs.

So, I'm not able to say whether this is my issue with resources or whether it is an OpenNMS system issue.

HOWEVER, I built a jvm Queued graph def from already collected data but don't have it in a patch format. This with the pending ops really helps see my box go into a death spiral. Attaching the definition.

Ken Eshelby July 26, 2012 at 12:40 PM

It looks like I've run into a limitation of collection somewhere. I've scaled back collection (no more collect anything with a word in ifAlias) and the server ran through the night, plus the Queued data looks reasonable. The concern I see is I never get excessive I/O wait or IOPS over other onms servers writing to our SAN, so it might be something within OpenNMS.

Going to re-enable the jmdns strategy to keep variables consistent.

Ken Eshelby July 25, 2012 at 8:49 PM

I am tending to agree. No change in behavior setting the NullStrategy.

So this can probably be closed. I am going to see what pulling back data collection does. The queued log is pretty busy with errors and I see the same growth of pending queued operations, with about 2k/s operations queued and many less updates completed. But hey it made me build a new graph for all that info since it was already being collected.

Cannot Reproduce

Details

Assignee

Reporter

Components

Affects versions

Priority

PagerDuty

Created July 24, 2012 at 10:00 PM
Updated November 5, 2012 at 4:19 PM
Resolved November 5, 2012 at 4:19 PM