Prometheus Collector attempting to persist non-integer values to counters

Description

RRD files initially created correctly but we only get NaNs (please see also attached rrdtool dump file total_idle.xml).
This behaviour only seen with prometheus data with two labels - for example:

windows_cpu_time_total{core="0,0",mode="idle"}

Exporter data with max of one label store values correctly.

tested with windows_exporter: https://github.com/prometheus-community/windows_exporter installed on Windows Server 2016

to reproduce:

  • install windows_exporter on a windows node

  • please use attached datacollection configs and foreign-source configs to

  • add the windows node with installed prometheus exporter to a requisition and synchronize

  • wait some time, open node Resource Graphs and takes a look in the graph data

example data from windows_exporter:

windows_terminal_services_local_session_count{session="active"} 0
windows_terminal_services_local_session_count{session="inactive"} 3
windows_terminal_services_local_session_count{session="total"} 3
windows_logical_disk_free_bytes{volume="C:"} 9.302966272e+09
windows_logical_disk_free_bytes{volume="HarddiskVolume1"} 1.54140672e+08
windows_logical_disk_size_bytes{volume="C:"} 4.2422239232e+10
windows_logical_disk_size_bytes{volume="HarddiskVolume1"} 5.23239424e+08
windows_cpu_time_total{core="0,0",mode="dpc"} 5.984375
windows_cpu_time_total{core="0,0",mode="idle"} 248335.359375
windows_cpu_time_total{core="0,0",mode="interrupt"} 15.25
windows_cpu_time_total{core="0,0",mode="privileged"} 2093
windows_cpu_time_total{core="0,0",mode="user"} 7767.921875
windows_cpu_time_total{core="0,1",mode="dpc"} 46.03125
windows_cpu_time_total{core="0,1",mode="idle"} 250622.234375
windows_cpu_time_total{core="0,1",mode="interrupt"} 119.84375
windows_cpu_time_total{core="0,1",mode="privileged"} 2224.484375
windows_cpu_time_total{core="0,1",mode="user"} 5349.40625

in collectd debug logs no errors found:

2020-11-16 09:34:08,475 DEBUG [Collectd-Thread-9-of-150] o.o.n.c.s.AbstractCollectionAttribute: Visiting attribute Attribute[total_idle:250299.53125]
2020-11-16 09:34:08,476 DEBUG [Collectd-Thread-9-of-150] o.o.n.c.a.AbstractPersister: Persisting Attribute[total_idle:250299.53125]
2020-11-16 09:34:08,476 DEBUG [Collectd-Thread-9-of-150] o.o.n.c.a.AbstractPersister: Storing attribute Attribute[total_idle:250299.53125]
2020-11-16 09:34:08,476 INFO [Collectd-Thread-9-of-150] o.o.n.r.RrdMetaDataUtils: createMetaDataFile: creating meta data file /var/lib/opennms/rrd/snmp/fs/01-Test/1605513515011/winExporterCPU/0_0/total_idle.meta with values '{GROUP=windows_exporter_cpu_time, total_idle=total_idle}'
2020-11-16 09:34:08,476 DEBUG [Collectd-Thread-9-of-150] o.o.n.r.r.MultithreadedJniRrdStrategy: createDefinition: filename [/var/lib/opennms/rrd/snmp/fs/01-Test/1605513515011/winExporterCPU/0_0/total_idle.rrd] already exists returning null as definition
2020-11-16 09:34:08,476 DEBUG [Collectd-Thread-9-of-150] o.o.n.r.r.MultithreadedJniRrdStrategy: createRRD: skipping RRD file
2020-11-16 09:34:08,476 INFO [Collectd-Thread-9-of-150] o.o.n.c.p.r.RrdPersistOperationBuilder: updateRRD: updating RRD file /var/lib/opennms/rrd/snmp/fs/01-Test/1605513515011/winExporterCPU/0_0/total_idle.rrd with values '1605515648:250299.53125'
2020-11-16 09:34:08,476 DEBUG [Collectd-Thread-9-of-150] o.o.n.c.p.r.RrdPersistOperationBuilder: updateRRD: RRD update command completed.

 the rrd file exists and gets updates

ls -al /var/lib/opennms/rrd/snmp/fs/01-Test/1605513515011/winExporterCPU/0_0/total_idle.rrd
-rw-r--r-- 1 root root 38232 Nov 16 09:34 /var/lib/opennms/rrd/snmp/fs/01-Test/1605513515011/winExporterCPU/0_0/total_idle.rrd

and in graph data from Resource Graphs:

Mon Nov 16 09:40:00 2020 NaN
Mon Nov 16 09:35:00 2020 NaN
Mon Nov 16 09:30:00 2020 NaN
Mon Nov 16 09:25:00 2020 NaN
Mon Nov 16 09:20:00 2020 NaN
Mon Nov 16 09:15:00 2020 NaN
Mon Nov 16 09:10:00 2020 NaN
Mon Nov 16 09:05:00 2020 NaN

please see also attached rrdtool dump

 

 

Acceptance / Success Criteria

None

Attachments

4

Lucidchart Diagrams

Activity

Show:

Dino Yancey December 3, 2020 at 9:31 PM

Merged

Dino Yancey November 21, 2020 at 4:07 AM

The collector is trying to persist non-integer values to counter datasources. Apparently jrobin (and newts, I'm sure) are more tolerant of this than rrd-based strategies.

 

PR: https://github.com/OpenNMS/opennms/pull/3222

Dino Yancey November 20, 2020 at 5:18 PM

Curiouser and curiouser.  Something with the rrd strategy seems to be silently failing when trying to write the sample data, for a majority of the samples.  I left a collection running for a few days on a 30 second interval and when I revisited this issue it contained data, but with very inconsistent timestamps:

 

 

 

Dino Yancey November 18, 2020 at 7:53 PM
Edited

* edited after further research *

Your collection works for me on 27.0.0 if I use JRobinRrdStrategy / jrobin, but nothing persists if I use MultithreadedJniRrdStrategy / rrd.

Not sure why this is, but it's a useful data point.

 

Martin Lärcher November 16, 2020 at 10:13 AM

The same behaviour with node-exporter and samples from here:

https://opennms.discourse.group/t/enhancing-prometheus-node-exporter-data-collections/1404/3

 

Fixed

Details

Assignee

Reporter

Components

Sprint

Fix versions

Affects versions

Priority

PagerDuty

Created November 16, 2020 at 9:13 AM
Updated December 17, 2020 at 3:20 PM
Resolved December 3, 2020 at 9:31 PM