Queued should ignore duplicate updates in a single operation

Description

At a customer site, we encountered unexplained persistence issues with some, but not all, data collections. Investigation lead us to the following exception from {{queued}}:

2022-08-17 12:43:18,931 DEBUG [QueuingRrdStrategy-2] queued: Error closing rrd Filename: /opt/opennms/share/rrd/snmp/4651/JBoss-Memory/0HeapMemUsageMax.rrd Arguments: [1660752637:4294967296, 1660752637:4294967296, 1660752701:4294967296, 1660752701:4294967296, 1660752762:4294967296, 1660752762:4294967296, 1660752822:4294967296, 1660752822:4294967296, 1660752883:4294967296, 1660752883:4294967296, 1660752944:4294967296, 1660752944:4294967296, 1660753005:4294967296, 1660753005:4294967296, 1660753067:4294967296, 1660753067:4294967296, 1660753127:4294967296, 1660753127:4294967296, 1660753189:4294967296, 1660753189:4294967296, 1660753249:4294967296, 1660753249:4294967296, 1660753311:4294967296, 1660753311:4294967296, 1660753371:4294967296, 1660753371:4294967296, 1660753432:4294967296, 1660753432:4294967296, 1660753493:4294967296, 1660753493:4294967296, 1660753553:4294967296, 1660753553:4294967296, 1660753614:4294967296, 1660753614:4294967296, 1660753675:4294967296, 1660753675:4294967296, 1660753736:4294967296, 1660753736:4294967296, 1660753798:4294967296, 1660753798:4294967296, 1660753859:4294967296, 1660753859:4294967296, 1660753922:4294967296, 1660753922:4294967296, 1660753983:4294967296, 1660753983:4294967296, 1660754044:4294967296, 1660754044:4294967296, 1660754105:4294967296, 1660754105:4294967296, 1660754166:4294967296, 1660754166:4294967296, 1660754226:4294967296, 1660754226:4294967296, 1660754287:4294967296, 1660754287:4294967296, 1660754347:4294967296, 1660754347:4294967296, 1660754408:4294967296, 1660754408:4294967296, 1660754469:4294967296, 1660754469:4294967296, 1660754530:4294967296, 1660754530:4294967296, 1660754591:4294967296, 1660754591:4294967296]: /opt/opennms/share/rrd/snmp/4651/JBoss-Memory/0HeapMemUsageMax.rrd: illegal attempt to update using time 1660752637 when last update time is 1660752637 (minimum one second step) org.opennms.netmgt.rrd.jrrd2.api.JRrd2Exception: /opt/opennms/share/rrd/snmp/4651/JBoss-Memory/0HeapMemUsageMax.rrd: illegal attempt to update using time 1660752637 when last update time is 1660752637 (minimum one second step) at org.opennms.netmgt.rrd.jrrd2.impl.Interface.rrd_update_r(Native Method) ~[jrrd2.jar:?] at org.opennms.netmgt.rrd.jrrd2.impl.JRrd2Jni.update(JRrd2Jni.java:74) ~[jrrd2.jar:?] at org.opennms.netmgt.rrd.rrdtool.MultithreadedJniRrdStrategy$UpdateCommand.execute(MultithreadedJniRrdStrategy.java:94) ~[opennms-rrdtool-api-29.0.5.jar:?] at org.opennms.netmgt.rrd.rrdtool.MultithreadedJniRrdStrategy.closeFile(MultithreadedJniRrdStrategy.java:131) ~[opennms-rrdtool-api-29.0.5.jar:?] at org.opennms.netmgt.rrd.rrdtool.MultithreadedJniRrdStrategy.closeFile(MultithreadedJniRrdStrategy.java:51) ~[opennms-rrdtool-api-29.0.5.jar:?] at org.opennms.netmgt.rrd.QueuingRrdStrategy.processClose(QueuingRrdStrategy.java:1136) [opennms-rrd-api-29.0.5.jar:?] at org.opennms.netmgt.rrd.QueuingRrdStrategy$1.run(QueuingRrdStrategy.java:1124) [opennms-rrd-api-29.0.5.jar:?] at org.opennms.core.logging.Logging.withPrefix(Logging.java:71) [org.opennms.core.logging-29.0.5.jar:?] at org.opennms.netmgt.rrd.QueuingRrdStrategy.processPendingOperations(QueuingRrdStrategy.java:1096) [opennms-rrd-api-29.0.5.jar:?] at org.opennms.netmgt.rrd.QueuingRrdStrategy.run(QueuingRrdStrategy.java:1069) [opennms-rrd-api-29.0.5.jar:?] at java.lang.Thread.run(Thread.java:834) [?:?]

The exception shows duplicate updates in the same second for this RRD. The root cause of the issue was duplication of datacollection groups within the datacollection config, but Queued should handle this situation more gracefully, perhaps by dropping (but logging) duplicate update epoch:sample pairs for the same RRD DS.

Again, the root cause of this issue was not queued, but I think improvements to queued could have prevented metric data loss in this situation.

Queued should also log this exception at a level higher the DEBUG. If it's failing updates, that's clearly ERROR.

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Details

Assignee

Reporter

Labels

HB Grooming Date

HB Backlog Status

Affects versions

Priority

PagerDuty

Created August 17, 2022 at 6:22 PM
Updated June 1, 2023 at 3:23 PM

Flag notifications