XML Collector is not working as expected for node-level resources

Description

XML Datacollection is messing up RRD Storage Directory.
Side effect is that data is stored in wrong directory and it causes RRD concurrent write access.
Seen exception in log is

org.opennms.netmgt.collection.api.CollectionException: An undeclared throwable was caught during data collection for interface 27/10.200.19.12/CiscoPorts XML-Collector
	at org.opennms.netmgt.collectd.CollectableService.doCollection(CollectableService.java:421) ~[opennms-services-15.0.1.jar:?]
	at org.opennms.netmgt.collectd.CollectableService.doRun(CollectableService.java:322) [opennms-services-15.0.1.jar:?]
	at org.opennms.netmgt.collectd.CollectableService.access$000(CollectableService.java:70) [opennms-services-15.0.1.jar:?]
	at org.opennms.netmgt.collectd.CollectableService$1.run(CollectableService.java:300) [opennms-services-15.0.1.jar:?]
	at org.opennms.core.logging.Logging.withPrefix(Logging.java:66) [org.opennms.core.logging-15.0.1.jar:?]
	at org.opennms.netmgt.collectd.CollectableService.run(CollectableService.java:296) [opennms-services-15.0.1.jar:?]
	at org.opennms.netmgt.scheduler.LegacyScheduler$1.run(LegacyScheduler.java:209) [opennms-services-15.0.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_72]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_72]
	at org.opennms.core.concurrent.LogPreservingThreadFactory$3.run(LogPreservingThreadFactory.java:124) [opennms-util-15.0.1.jar:?]
	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_72]
Caused by: java.util.ConcurrentModificationException
	at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922) ~[?:1.7.0_72]
	at java.util.HashMap$KeyIterator.next(HashMap.java:956) ~[?:1.7.0_72]
	at org.opennms.netmgt.collection.api.AttributeGroup.visit(AttributeGroup.java:107) ~[org.opennms.features.collection.api-15.0.1.jar:?]
	at org.opennms.netmgt.collection.support.AbstractCollectionResource.visit(AbstractCollectionResource.java:117) ~[org.opennms.features.collection.api-15.0.1.jar:?]
	at org.opennms.netmgt.collection.support.MultiResourceCollectionSet.visit(MultiResourceCollectionSet.java:69) ~[org.opennms.features.collection.api-15.0.1.jar:?]
	at org.opennms.netmgt.collectd.CollectableService.doCollection(CollectableService.java:394) ~[opennms-services-15.0.1.jar:?]
	... 10 more

Exception is associated to [Collectd-Thread-41-of-50] collecting one node for XML service called CiscoPorts-XML-Collection according to log

2015-02-27 11:43:48,876 ERROR [Collectd-Thread-41-of-50] o.o.n.c.CollectableService: An undeclared throwable was caught during data collection for interface 27/10.200.19.12/CiscoPorts XML-Collector

What I can see in log is that [Collectd-Thread-27-of-50] which is also collecting one node for same service has an incorrect node to RRD directory mapping according to logs

2015-02-27 11:43:48,063 INFO  [Collectd-Thread-27-of-50] o.o.n.c.CollectableService: run: starting new collection for 12/10.200.39.12/CiscoPorts XML-Collector/CiscoPorts-XML-Collection
2015-02-27 11:43:48,804 DEBUG [Collectd-Thread-27-of-50] o.o.n.c.DefaultCollectionAgent: getStorageDir: isStoreByForeignSource = false, foreignSource = null, foreignId = null, dir = 27

I attach full collectd.log taken from opennms start, Hope it can help pointing the issue.

Notes:

Exact same configuration running in opennms 1.12.9-2 was running perfectly well.
As far as I see only XML Datacollection is affected (No SNMP datacollection problem)

Environment

Linux Redhat, Oracle Java version 1.7.0_72

Acceptance / Success Criteria

None

Attachments

04 Jun 2015, 02:33 PM
03 Jun 2015, 05:30 AM
02 Jun 2015, 05:29 AM
27 Feb 2015, 06:48 AM

Lucidchart Diagrams

Activity

Show:

Alejandro Galue April 2, 2016 at 9:04 AM

Yes

Ronny Trommer April 1, 2016 at 8:26 PM

@Alejandro Galue seems to be fixed, can we delete this branch? https://github.com/OpenNMS/opennms/tree/jira/NMS-7516-foundation

Alejandro Galue June 5, 2015 at 3:12 PM

Merged into foundation on revision 297ecdb71402257b687165fecd7654d1a48a7c81.

Merged into develop on revision 297ecdb71402257b687165fecd7654d1a48a7c81.

Cherry-picked on release-16.0.1 on revision 1d9edda5ce94c9e7765f32307ca669d56bfcefe5.

Alejandro Galue June 4, 2015 at 4:25 PM

The problem was related with a variable used inside of the XML Collection Handler implementations that was not thread safe. I refactored the code for all the handlers to be 100% stateless, and I added more details on the DEBUG messages.

The code has been committed on a branch from foundation called jira/NMS-7516-foundation. When Bamboo goes green, I'll merge the branch into foundation and then into develop to be ready for 16.0.1 and 17.0.0.

For more details:

http://bamboo.internal.opennms.com:8085/browse/OPENNMS-NMS335

Alejandro Galue June 3, 2015 at 3:41 PM

I think I was able to reproduce the problem locally.

I'm going to do more tests to be sure, and try to see if I can reproduce the problem on a JUnit test.

The problem seems to be related with the XML Collector after all

With just one node, the XML Collector works just fine (as I've shown to you), but when I tried to collect data for the same service from different nodes (like your use case), the collector get confused when storing the data on JRB/RRD files.

To illustrate better what I found, I'm using storeByGroup and storeByForeignSource enabled:

$ grep "updateRRD: updating" collectd.log

2015-06-03 12:43:37,667  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349818:105.0:21.0:210.0:315.0:420.0:525.0'
2015-06-03 12:43:37,707  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349818:105.0:21.0:210.0:315.0:420.0:525.0'
2015-06-03 12:43:37,725  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/opennms-server/sample.rrd with values '1433349818:2575.0:515.0:5150.0:7725.0:10300.0:12875.0'
2015-06-03 12:44:07,955  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349848:110.0:22.0:220.0:330.0:440.0:550.0'
2015-06-03 12:44:09,017  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos7srv/sample.rrd with values '1433349849:110.0:22.0:220.0:330.0:440.0:550.0'
2015-06-03 12:44:09,031  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos7srv/sample.rrd with values '1433349849:110.0:22.0:5160.0:330.0:10320.0:12900.0'
2015-06-03 12:44:38,960  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349879:115.0:23.0:230.0:345.0:460.0:575.0'
2015-06-03 12:44:40,046  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/opennms-server/sample.rrd with values '1433349880:120.0:24.0:5170.0:360.0:480.0:600.0'
2015-06-03 12:45:09,988  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349910:120.0:24.0:240.0:360.0:480.0:600.0'
2015-06-03 12:45:11,031  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/opennms-server/sample.rrd with values '1433349911:2590.0:518.0:5180.0:7770.0:10360.0:12950.0'
2015-06-03 12:45:11,033  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos7srv/sample.rrd with values '1433349911:125.0:25.0:250.0:375.0:500.0:625.0'
2015-06-03 12:45:41,010  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos6srv/sample.rrd with values '1433349941:125.0:25.0:250.0:375.0:500.0:625.0'
2015-06-03 12:45:42,056  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos7srv/sample.rrd with values '1433349942:130.0:519.0:260.0:390.0:520.0:650.0'
2015-06-03 12:45:42,059  updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/fs/Local/centos7srv/sample.rrd with values '1433349942:130.0:519.0:260.0:390.0:10380.0:12975.0'

I'm monitoring every 30 seconds, and as you can see, on a very random way, the 3 RRDs that must be updated every 30 seconds, are not being updated properly, which is not correct.

I haven't dig into the code yet, but at least I know that there is something wrong somewhere

Fixed

Details
Assignee
Alejandro Galue
Reporter
Equipe Equipe
Labels
bug
Components
Data Collection - XML
Sprint
None
Fix versions
16.0.1
Meridian-2015.1.1
17.0.0
Affects versions
15.0.1
Meridian-2015.1.0
16.0.0
Priority
Blocker

PagerDuty

Created February 27, 2015 at 6:48 AM

Updated April 2, 2016 at 9:04 AM

Resolved June 5, 2015 at 3:12 PM

Configure

XML Collector is not working as expected for node-level resources

Description

Environment

Acceptance / Success Criteria

Attachments

Lucidchart Diagrams

Activity

Alejandro Galue April 2, 2016 at 9:04 AM

Ronny Trommer April 1, 2016 at 8:26 PM

Alejandro Galue June 5, 2015 at 3:12 PM

Alejandro Galue June 4, 2015 at 4:25 PM

Alejandro Galue June 3, 2015 at 3:41 PM

Details
Assignee
Alejandro Galue
Reporter
Equipe Equipe
Labels
bug
Components
Data Collection - XML
Sprint
None
Fix versions
16.0.1
Meridian-2015.1.1
17.0.0
Affects versions
15.0.1
Meridian-2015.1.0
16.0.0
Priority
Blocker

Details

Assignee

Reporter

Labels

Components

Sprint

Fix versions

Affects versions

Priority

PagerDuty

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

XML Collector is not working as expected for node-level resources

Description

Environment

Acceptance / Success Criteria

Attachments

Lucidchart Diagrams

Activity

Alejandro Galue April 2, 2016 at 9:04 AM

Ronny Trommer April 1, 2016 at 8:26 PM

Alejandro Galue June 5, 2015 at 3:12 PM

Alejandro Galue June 4, 2015 at 4:25 PM

Alejandro Galue June 3, 2015 at 3:41 PM

DetailsAssigneeAlejandro GalueAlejandro GalueReporterEquipe EquipeEquipe EquipeLabelsbugComponentsData Collection - XMLSprintNone+1Fix versions16.0.1Meridian-2015.1.117.0.0Affects versions15.0.1Meridian-2015.1.016.0.0PriorityBlocker

Details

Assignee

Reporter

Labels

Components

Sprint

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
Alejandro Galue
Reporter
Equipe Equipe
Labels
bug
Components
Data Collection - XML
Sprint
None
Fix versions
16.0.1
Meridian-2015.1.1
17.0.0
Affects versions
15.0.1
Meridian-2015.1.0
16.0.0
Priority
Blocker

PagerDuty