Implement org.opennms.timeseries.strategy=evaluate to facilitate the sizing process

Description

Based on my first experience with Newts at a very large scale, I think it will be extremely useful to have an evaluation strategy for the persistence layer.

This evaluation layer will help to answer all the questions we have in order to decide which strategy fits better a particular installation (i.e. RRDtool or Newts), and of course the proper way to do the sizing the solution, specially for Newts.

On almost all the deployments on which I had the opportunity to work, the inventory of devices is well known.

On big installations, this inventory already exist on a CMDB (or something similar) which is used to feed the requisitions in OpenNMS. The customers know how many nodes they have, but it is impossible to know prior starting sizing the monitoring system, how many collectable resources they have, how many metrics they are expecting to collect, how many threads are required for polling and collecting, how fast the disks have to be to support the I/O load for RRDtool, or how many servers are required for Cassandra.

Having this in mind, we can very quickly have that inventory up and running, but because we have to choose a persistence strategy from the beginning that could make OpenNMS unusable as soon as you start it.

For this reason, we can have this evaluating strategy, that will let Collectd and Pollerd do their job, and at the same time, collect statistics like metrics per second, resources per node, average metrics per node, Top-N nodes with lots of metrics, Top-N metrics, etc.

After a certain period of time, we will have a very useful set of statistics, and when you stop OpenNMS, this persistence layer can generate a report (or you can trigger it on demand through a specific event if you want), and finally we can have something deterministic to decide how to properly dimension the persistence layer without guessing.

Initially, I was thinking on doing this for Collectd, but probably it could make sense for Pollerd as well.

The idea is let Collectd do its job normally, then this "very thin" layer will just count things and will update a shared object with the statistics. That way we will be able to use ILR to figure out if the amount of threads are correct (without disturbing these stats for wrongly choosing the persistence strategy from the beginning). At the same time, we will have real data about the metrics that we would be trying to store on the real persistence layer without crashing OpenNMS (because let's be honest to ourselves, there is no way to know how many Cassandra nodes and how they should be configured without knowing how many metrics per second you're planning to store, and something similar is true for the disk layout if you go with RRdtool).

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Alejandro Galue April 18, 2016 at 2:03 PM

https://github.com/OpenNMS/opennms/pull/723

Alejandro Galue April 13, 2016 at 9:40 AM

That looks amazing, I'm going to read and play with that library.

Jesse White April 13, 2016 at 9:36 AM

I'd highly recommend using metrics-core for maintaining the stats and generating the reports.

Fixed

Details
Assignee
Alejandro Galue
Reporter
Alejandro Galue
Components
Fix versions
18.0.0
Meridian-2016.1.0
Affects versions
17.1.1
Priority
Major

PagerDuty

Created April 13, 2016 at 9:29 AM

Updated April 19, 2016 at 8:35 PM

Resolved April 19, 2016 at 4:37 PM

Implement org.opennms.timeseries.strategy=evaluate to facilitate the sizing process

Description

Acceptance / Success Criteria

Lucidchart Diagrams

Activity

Alejandro Galue April 18, 2016 at 2:03 PM

Alejandro Galue April 13, 2016 at 9:40 AM

Jesse White April 13, 2016 at 9:36 AM

DetailsAssigneeAlejandro GalueAlejandro GalueReporterAlejandro GalueAlejandro GalueComponentsFix versions18.0.0Meridian-2016.1.0Affects versions17.1.1PriorityMajor

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Details
Assignee
Alejandro Galue
Reporter
Alejandro Galue
Components
Fix versions
18.0.0
Meridian-2016.1.0
Affects versions
17.1.1
Priority
Major

PagerDuty