XML Collector is not working as expected for node-level resources

Description

XML Datacollection is messing up RRD Storage Directory.
Side effect is that data is stored in wrong directory and it causes RRD concurrent write access.
Seen exception in log is

org.opennms.netmgt.collection.api.CollectionException: An undeclared throwable was caught during data collection for interface 27/10.200.19.12/CiscoPorts XML-Collector at org.opennms.netmgt.collectd.CollectableService.doCollection(CollectableService.java:421) ~[opennms-services-15.0.1.jar:?] at org.opennms.netmgt.collectd.CollectableService.doRun(CollectableService.java:322) [opennms-services-15.0.1.jar:?] at org.opennms.netmgt.collectd.CollectableService.access$000(CollectableService.java:70) [opennms-services-15.0.1.jar:?] at org.opennms.netmgt.collectd.CollectableService$1.run(CollectableService.java:300) [opennms-services-15.0.1.jar:?] at org.opennms.core.logging.Logging.withPrefix(Logging.java:66) [org.opennms.core.logging-15.0.1.jar:?] at org.opennms.netmgt.collectd.CollectableService.run(CollectableService.java:296) [opennms-services-15.0.1.jar:?] at org.opennms.netmgt.scheduler.LegacyScheduler$1.run(LegacyScheduler.java:209) [opennms-services-15.0.1.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_72] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_72] at org.opennms.core.concurrent.LogPreservingThreadFactory$3.run(LogPreservingThreadFactory.java:124) [opennms-util-15.0.1.jar:?] at java.lang.Thread.run(Thread.java:745) [?:1.7.0_72] Caused by: java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922) ~[?:1.7.0_72] at java.util.HashMap$KeyIterator.next(HashMap.java:956) ~[?:1.7.0_72] at org.opennms.netmgt.collection.api.AttributeGroup.visit(AttributeGroup.java:107) ~[org.opennms.features.collection.api-15.0.1.jar:?] at org.opennms.netmgt.collection.support.AbstractCollectionResource.visit(AbstractCollectionResource.java:117) ~[org.opennms.features.collection.api-15.0.1.jar:?] at org.opennms.netmgt.collection.support.MultiResourceCollectionSet.visit(MultiResourceCollectionSet.java:69) ~[org.opennms.features.collection.api-15.0.1.jar:?] at org.opennms.netmgt.collectd.CollectableService.doCollection(CollectableService.java:394) ~[opennms-services-15.0.1.jar:?] ... 10 more

Exception is associated to [Collectd-Thread-41-of-50] collecting one node for XML service called CiscoPorts-XML-Collection according to log

2015-02-27 11:43:48,876 ERROR [Collectd-Thread-41-of-50] o.o.n.c.CollectableService: An undeclared throwable was caught during data collection for interface 27/10.200.19.12/CiscoPorts XML-Collector

What I can see in log is that [Collectd-Thread-27-of-50] which is also collecting one node for same service has an incorrect node to RRD directory mapping according to logs

2015-02-27 11:43:48,063 INFO [Collectd-Thread-27-of-50] o.o.n.c.CollectableService: run: starting new collection for 12/10.200.39.12/CiscoPorts XML-Collector/CiscoPorts-XML-Collection 2015-02-27 11:43:48,804 DEBUG [Collectd-Thread-27-of-50] o.o.n.c.DefaultCollectionAgent: getStorageDir: isStoreByForeignSource = false, foreignSource = null, foreignId = null, dir = 27

I attach full collectd.log taken from opennms start, Hope it can help pointing the issue.

Notes:

  • Exact same configuration running in opennms 1.12.9-2 was running perfectly well.

  • As far as I see only XML Datacollection is affected (No SNMP datacollection problem)

Environment

Linux Redhat, Oracle Java version 1.7.0_72

Acceptance / Success Criteria

None

Attachments

4
  • 04 Jun 2015, 02:33 PM
  • 03 Jun 2015, 05:30 AM
  • 02 Jun 2015, 05:29 AM
  • 27 Feb 2015, 06:48 AM

Lucidchart Diagrams

Activity

Show:

Alejandro Galue April 2, 2016 at 9:04 AM

Yes

Ronny Trommer April 1, 2016 at 8:26 PM

seems to be fixed, can we delete this branch? https://github.com/OpenNMS/opennms/tree/jira/NMS-7516-foundation

Alejandro Galue June 5, 2015 at 3:12 PM

Merged into foundation on revision 297ecdb71402257b687165fecd7654d1a48a7c81.

Merged into develop on revision 297ecdb71402257b687165fecd7654d1a48a7c81.

Cherry-picked on release-16.0.1 on revision 1d9edda5ce94c9e7765f32307ca669d56bfcefe5.

Alejandro Galue June 4, 2015 at 4:25 PM

The problem was related with a variable used inside of the XML Collection Handler implementations that was not thread safe. I refactored the code for all the handlers to be 100% stateless, and I added more details on the DEBUG messages.

The code has been committed on a branch from foundation called jira/NMS-7516-foundation. When Bamboo goes green, I'll merge the branch into foundation and then into develop to be ready for 16.0.1 and 17.0.0.

For more details:

http://bamboo.internal.opennms.com:8085/browse/OPENNMS-NMS335

Alejandro Galue June 3, 2015 at 3:41 PM

I think I was able to reproduce the problem locally.

I'm going to do more tests to be sure, and try to see if I can reproduce the problem on a JUnit test.

The problem seems to be related with the XML Collector after all winking face

With just one node, the XML Collector works just fine (as I've shown to you), but when I tried to collect data for the same service from different nodes (like your use case), the collector get confused when storing the data on JRB/RRD files.

To illustrate better what I found, I'm using storeByGroup and storeByForeignSource enabled:

$ grep "updateRRD: updating" collectd.log

2015-06-03 12:43:37,667 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349818:105.0:21.0:210.0:315.0:420.0:525.0' 2015-06-03 12:43:37,707 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349818:105.0:21.0:210.0:315.0:420.0:525.0' 2015-06-03 12:43:37,725 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/opennms-server/sample.rrd with values '1433349818:2575.0:515.0:5150.0:7725.0:10300.0:12875.0' 2015-06-03 12:44:07,955 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349848:110.0:22.0:220.0:330.0:440.0:550.0' 2015-06-03 12:44:09,017 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos7srv/sample.rrd with values '1433349849:110.0:22.0:220.0:330.0:440.0:550.0' 2015-06-03 12:44:09,031 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos7srv/sample.rrd with values '1433349849:110.0:22.0:5160.0:330.0:10320.0:12900.0' 2015-06-03 12:44:38,960 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349879:115.0:23.0:230.0:345.0:460.0:575.0' 2015-06-03 12:44:40,046 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/opennms-server/sample.rrd with values '1433349880:120.0:24.0:5170.0:360.0:480.0:600.0' 2015-06-03 12:45:09,988 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349910:120.0:24.0:240.0:360.0:480.0:600.0' 2015-06-03 12:45:11,031 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/opennms-server/sample.rrd with values '1433349911:2590.0:518.0:5180.0:7770.0:10360.0:12950.0' 2015-06-03 12:45:11,033 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos7srv/sample.rrd with values '1433349911:125.0:25.0:250.0:375.0:500.0:625.0' 2015-06-03 12:45:41,010 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349941:125.0:25.0:250.0:375.0:500.0:625.0' 2015-06-03 12:45:42,056 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos7srv/sample.rrd with values '1433349942:130.0:519.0:260.0:390.0:520.0:650.0' 2015-06-03 12:45:42,059 updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos7srv/sample.rrd with values '1433349942:130.0:519.0:260.0:390.0:10380.0:12975.0'

I'm monitoring every 30 seconds, and as you can see, on a very random way, the 3 RRDs that must be updated every 30 seconds, are not being updated properly, which is not correct.

I haven't dig into the code yet, but at least I know that there is something wrong somewhere winking face

Fixed

Details

Assignee

Reporter

Labels

Sprint

Priority

PagerDuty

Created February 27, 2015 at 6:48 AM
Updated April 2, 2016 at 9:04 AM
Resolved June 5, 2015 at 3:12 PM

Flag notifications