Fixed
Details
Assignee
Alejandro GalueAlejandro GalueReporter
Alejandro GalueAlejandro GalueComponents
Fix versions
Affects versions
Priority
Major
Details
Details
Assignee
Alejandro Galue
Alejandro GalueReporter
Alejandro Galue
Alejandro GalueComponents
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty
PagerDuty
Created April 13, 2016 at 9:29 AM
Updated April 19, 2016 at 8:35 PM
Resolved April 19, 2016 at 4:37 PM
Based on my first experience with Newts at a very large scale, I think it will be extremely useful to have an evaluation strategy for the persistence layer.
This evaluation layer will help to answer all the questions we have in order to decide which strategy fits better a particular installation (i.e. RRDtool or Newts), and of course the proper way to do the sizing the solution, specially for Newts.
On almost all the deployments on which I had the opportunity to work, the inventory of devices is well known.
On big installations, this inventory already exist on a CMDB (or something similar) which is used to feed the requisitions in OpenNMS. The customers know how many nodes they have, but it is impossible to know prior starting sizing the monitoring system, how many collectable resources they have, how many metrics they are expecting to collect, how many threads are required for polling and collecting, how fast the disks have to be to support the I/O load for RRDtool, or how many servers are required for Cassandra.
Having this in mind, we can very quickly have that inventory up and running, but because we have to choose a persistence strategy from the beginning that could make OpenNMS unusable as soon as you start it.
For this reason, we can have this evaluating strategy, that will let Collectd and Pollerd do their job, and at the same time, collect statistics like metrics per second, resources per node, average metrics per node, Top-N nodes with lots of metrics, Top-N metrics, etc.
After a certain period of time, we will have a very useful set of statistics, and when you stop OpenNMS, this persistence layer can generate a report (or you can trigger it on demand through a specific event if you want), and finally we can have something deterministic to decide how to properly dimension the persistence layer without guessing.
Initially, I was thinking on doing this for Collectd, but probably it could make sense for Pollerd as well.
The idea is let Collectd do its job normally, then this "very thin" layer will just count things and will update a shared object with the statistics. That way we will be able to use ILR to figure out if the amount of threads are correct (without disturbing these stats for wrongly choosing the persistence strategy from the beginning). At the same time, we will have real data about the metrics that we would be trying to store on the real persistence layer without crashing OpenNMS (because let's be honest to ourselves, there is no way to know how many Cassandra nodes and how they should be configured without knowing how many metrics per second you're planning to store, and something similar is true for the disk layout if you go with RRdtool).