Details
Assignee
UnassignedUnassignedReporter
Dino YanceyDino YanceyLabels
HB Grooming Date
Aug 30, 2022HB Backlog Status
BacklogComponents
Affects versions
Priority
Minor
Details
Details
Assignee
Unassigned
UnassignedReporter
Dino Yancey
Dino YanceyLabels
HB Grooming Date
Aug 30, 2022
HB Backlog Status
Backlog
Components
Affects versions
Priority
PagerDuty
PagerDuty
PagerDuty
Created August 17, 2022 at 6:22 PM
Updated June 1, 2023 at 3:23 PM
At a customer site, we encountered unexplained persistence issues with some, but not all, data collections. Investigation lead us to the following exception from {{queued}}:
2022-08-17 12:43:18,931 DEBUG [QueuingRrdStrategy-2] queued: Error closing rrd Filename: /opt/opennms/share/rrd/snmp/4651/JBoss-Memory/0HeapMemUsageMax.rrd Arguments: [1660752637:4294967296, 1660752637:4294967296, 1660752701:4294967296, 1660752701:4294967296, 1660752762:4294967296, 1660752762:4294967296, 1660752822:4294967296, 1660752822:4294967296, 1660752883:4294967296, 1660752883:4294967296, 1660752944:4294967296, 1660752944:4294967296, 1660753005:4294967296, 1660753005:4294967296, 1660753067:4294967296, 1660753067:4294967296, 1660753127:4294967296, 1660753127:4294967296, 1660753189:4294967296, 1660753189:4294967296, 1660753249:4294967296, 1660753249:4294967296, 1660753311:4294967296, 1660753311:4294967296, 1660753371:4294967296, 1660753371:4294967296, 1660753432:4294967296, 1660753432:4294967296, 1660753493:4294967296, 1660753493:4294967296, 1660753553:4294967296, 1660753553:4294967296, 1660753614:4294967296, 1660753614:4294967296, 1660753675:4294967296, 1660753675:4294967296, 1660753736:4294967296, 1660753736:4294967296, 1660753798:4294967296, 1660753798:4294967296, 1660753859:4294967296, 1660753859:4294967296, 1660753922:4294967296, 1660753922:4294967296, 1660753983:4294967296, 1660753983:4294967296, 1660754044:4294967296, 1660754044:4294967296, 1660754105:4294967296, 1660754105:4294967296, 1660754166:4294967296, 1660754166:4294967296, 1660754226:4294967296, 1660754226:4294967296, 1660754287:4294967296, 1660754287:4294967296, 1660754347:4294967296, 1660754347:4294967296, 1660754408:4294967296, 1660754408:4294967296, 1660754469:4294967296, 1660754469:4294967296, 1660754530:4294967296, 1660754530:4294967296, 1660754591:4294967296, 1660754591:4294967296]: /opt/opennms/share/rrd/snmp/4651/JBoss-Memory/0HeapMemUsageMax.rrd: illegal attempt to update using time 1660752637 when last update time is 1660752637 (minimum one second step) org.opennms.netmgt.rrd.jrrd2.api.JRrd2Exception: /opt/opennms/share/rrd/snmp/4651/JBoss-Memory/0HeapMemUsageMax.rrd: illegal attempt to update using time 1660752637 when last update time is 1660752637 (minimum one second step) at org.opennms.netmgt.rrd.jrrd2.impl.Interface.rrd_update_r(Native Method) ~[jrrd2.jar:?] at org.opennms.netmgt.rrd.jrrd2.impl.JRrd2Jni.update(JRrd2Jni.java:74) ~[jrrd2.jar:?] at org.opennms.netmgt.rrd.rrdtool.MultithreadedJniRrdStrategy$UpdateCommand.execute(MultithreadedJniRrdStrategy.java:94) ~[opennms-rrdtool-api-29.0.5.jar:?] at org.opennms.netmgt.rrd.rrdtool.MultithreadedJniRrdStrategy.closeFile(MultithreadedJniRrdStrategy.java:131) ~[opennms-rrdtool-api-29.0.5.jar:?] at org.opennms.netmgt.rrd.rrdtool.MultithreadedJniRrdStrategy.closeFile(MultithreadedJniRrdStrategy.java:51) ~[opennms-rrdtool-api-29.0.5.jar:?] at org.opennms.netmgt.rrd.QueuingRrdStrategy.processClose(QueuingRrdStrategy.java:1136) [opennms-rrd-api-29.0.5.jar:?] at org.opennms.netmgt.rrd.QueuingRrdStrategy$1.run(QueuingRrdStrategy.java:1124) [opennms-rrd-api-29.0.5.jar:?] at org.opennms.core.logging.Logging.withPrefix(Logging.java:71) [org.opennms.core.logging-29.0.5.jar:?] at org.opennms.netmgt.rrd.QueuingRrdStrategy.processPendingOperations(QueuingRrdStrategy.java:1096) [opennms-rrd-api-29.0.5.jar:?] at org.opennms.netmgt.rrd.QueuingRrdStrategy.run(QueuingRrdStrategy.java:1069) [opennms-rrd-api-29.0.5.jar:?] at java.lang.Thread.run(Thread.java:834) [?:?]
The exception shows duplicate updates in the same second for this RRD. The root cause of the issue was duplication of datacollection groups within the datacollection config, but Queued should handle this situation more gracefully, perhaps by dropping (but logging) duplicate update
epoch:sample
pairs for the same RRD DS.Again, the root cause of this issue was not queued, but I think improvements to queued could have prevented metric data loss in this situation.
Queued should also log this exception at a level higher the DEBUG. If it's failing updates, that's clearly ERROR.