Expose Provisiond status through JMX
Description
Acceptance / Success Criteria
is duplicated by
Lucidchart Diagrams
Activity

Markus von Rüden May 17, 2017 at 4:51 AM
Putting on hold for now.

Alejandro Galue May 8, 2017 at 1:48 PM
Here are some topics I'd like to touch during our call:
Provisioning Instrumentation Requirements:
Is the importer importing a requisition ? (customer requirement)
Which requisition is being imported at the moment ? (customer requirement)
How many requisitions are on the import queue ? (customer requirement)
Which requisitions are on the import queue ? (customer requirement)
Is the size of a given thread pool enough to perform the job ?
Are the thread pools working at the moment ?
How much time is spent on a given requisition ? If the whole number is not possible, at least have the number per phase (for example: The XXX takes 20min in average to be completely processed; or, The XXX takes in average 5min on Import, 10min on Scan, etc.)
How many errors have been found during the import process ? (I mean, in total on any phase, or per phase)
Pending:
Understand the internals and the motivation behind the provided stats on the PR
What can we do about the above questions
Potentially not needed:
Scheduler statistics at either node or requisition level (unless there is a purpose for this)

Alejandro Galue May 8, 2017 at 11:29 AM
If the requirements were not specific, you could have asked prior creating a PR
Because I was thinking on generic simple metrics to help troubleshooting Provisiond issues, JMX came to my mind (because that's what we have for the rest of the daemons in OpenNMS, and the JMX collector can handle those stats without issues as part of the default configuration). If the proper way to provide what it is required for Provisiond is not JMX, I'm open to suggestions.
I'll try to find a spot to have a call about this.

Markus von Rüden May 8, 2017 at 11:19 AMEdited
I can see your point, but to be fair, the original issue was not very specific of what was expected.
If the requirement is not to have the data exposed via JMX, we should not merge the PR, as the implementation is very JMX specific.
Can you schedule a call where we can discuss this in more detail?

Alejandro Galue May 8, 2017 at 10:08 AM
Although I was looking forward to see metrics that can help support and customers to have an idea about how Provisiond is behaving, the actual implementation was a lot more ambitious and provided lots of statistics which is very much appreciated, but the purpose of the Jira issue is lost on this implementation. With the implemented statistics, which cannot be retrieved through the JMX Collector due to their tabular nature, you have to implement other ways to extract the data. Otherwise, you won’t be able to figure how how Provisiond is behaving or where you should improve things.
The only "general purpose" statistics are the thread pools [ThreadPoolExecutorStatistics].
The rest of the statistics are very low level details. There are statistics on a per requisition basis [RequisitionStatistics, RequisitionImportScheduleDetails] and on a per node scheduler basis [NodeScanSchedule]. To be honest, I can’t see usefulness on the node scheduler stats (Memory Usage = 6 x fieldSize x numOfNodes), as a given installation could have tens thousands of nodes (besides having thousands of requisitions).
On the other hand, the per-requisition statistics (Memory Usage = 5 * 6 * fieldSize * numOfRequisitions; 5 seems to be the number of phases), can be integrated with the Requisitions UI if we implement a ReST API for it, to provide some input to the user about, for example, how much time is taken on each phase to be completed (but you should perform the manual sum, to figure out the whole time). It is not clear how to read the provided stats.
It is not clear the purpose of RequisitionImportScheduleDetails (marked as TabularData).
There is a natural concern about these statistics and it is the memory impact. Certainly JMX and Dropwizard metrics has low memory footprint, but we’re talking bout hundreds of thousands of metrics. So, combined, they might have an important impact which is one reason why I would do some memory compare on systems with thousands of nodes and thousands of requisitions.
Because the current state of the implementation doesn’t cover what was requested on the Jira issue, I would defer this implementation to foundation-2017. The main reason for including in foundation-2016 were what was described on Jira and unfortunately that doesn’t exist. Another reason to defer it, is the potential memory usage.
Of course, the statistics are lost if OpenNMS is restarted. These statistics change when requisitions are imported, so if there is no activity, there are no stats. In some cases, OpenNMS should be restarted even on a weekly basis for several reasons. For these reason, the actual usage of having moving averages is not useful at all as requisition imports is something that happens once per day, so 1min. 5min, 15min stats are not useful. I know that's how Dropwizard provides data this way, but the numbers should be transformed in order to something that makes sense, which makes even more important to have a ReST end point or something intelligent to get, process and present the data in something human readable and understandable.
I'd like to have an answer about the possibility to implement what was requested on this Jira issue on the first place. I'm asking because the state of the PR "as it is", can be merged to the code base but I think it should not be taken as a solution of this particular Jira issue.
As a final note, I spent some time writing a small JSP as a POC to show the RequisitionsStats as HTML and as JSON. The data is provided very quickly, at least orders of magnitudes faster than JMC (the only tool I found to see the data without code changes). This can be used to add a ReST end-point to integrate the requisition stats with the UI. This POC can be also used to measure the size of the JMX data and from there we can estimate the DropWizard size and other related classes that could have an influence on memory usage.
Details
Assignee
Markus von RüdenMarkus von Rüden(Deactivated)Reporter
Alejandro GalueAlejandro GalueDoc Backlog Status
NBDoc Backlog Grooming Date
Jul 28, 2021Components
Sprint
NoneAffects versions
Priority
Major
Details
Details
Assignee

Reporter

Doc Backlog Status
Doc Backlog Grooming Date
Components
Sprint
Affects versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

It is important to know the status of the different components and threads that are part of Provisiond. This can be useful for debugging issues and understand what Provisiond is doing on a given time.
I think we should expose:
Status of the importer.
Status of the import queue.
Content of the import queue.
Status of the import threads.
Status of the scan threads.
Status of the rescan threads.
Status of the writer threads.
Provisioning adapters enabled.