Improve performance of newts.indexing to avoid overwhelm Cassandra cluster

Description

I've been working on running our metrics:stress tool against Cassandra clusters in order to understand how well the cluster behave, and I found a problem associated with something called "Newts Index Inserts".

This feature can be controlled through an undocumented setting called: org.opennms.newts.disable.indexing. By default, the indexing is enabled.

This feature is required in order to be able to enumerate the resources and metrics available, which is also translated on populating primarily the newts.resource_metrics table on Cassandra.

This indexing process happens every time OpenNMS is started, and it takes a considerable amount of resources, specially from Cassandra, making the cluster temprarily unavailable (as it is extremely busy). The side effect in OpenNMS is that the ring buffer (regardless the configured size) goes to it maximum, and stays there for a while, specially under heavy load (100K samples per second or higher). When the ring buffer is full, OpenNMS is discarding samples, which is translated into holes on the graphs.

On my tests, using 4 `m4.10xlarge` EC2 instances running 3 Cassandra instances on each of them (as that is the use case I'm studing at the moment), for a total of 15 Cassandra nodes, the ring buffer is full for 15min.

Once this indexing work is done, the ring buffer goes to 0 and the Cassandra cluster starts working smothly and it is able to handle 100K even with 2 physical nodes down (which means 6 of the 12 cassandra instances down). From this point, if I disable indexing, I can restart OpenNMS without worring about performance.

Now, I made another test, which is start over with a fresh cluster and the indexing disabled. I can see that the samples table is being updated, no issues at OpenNMS or Cassandra, and the ring buffer is barely used (checked through JMX directly). Unfortunately, because the resource_metrics table is not updated, OpenNMS cannot enumare the resources and I cannot graph any performance metric. This is why the indexing has to be performed at least once.

The only way I found to reduce the indexing time is by brute force, which means, having a more powerful Cassandra cluster, which I think is not the best solution. If I build the cluster using m5.12xlarge, the index part finishes quicly using 50 percent of the ring buffer during indexing (and the size of the ring buffer is 2^22 = 4194304).

The idea would be understand where the heavy load is created on either Newts or the Persistence strategy in OpenNMS, to avoid overwhelm the cluster and be able to only indexing when it is necessary (not every time OpenNMS starts), and not at the same time, making sure that the cluster performance is not affected by spreading out those inserts, as the actual metrics are being generated during the inserts (which is why the ring buffer grows quickly).

Finally, I found that 2^22 is the maximum amount for the ring buffer I found that doesn't have a bad impact on OpenNMS performance. Greater numbers like 2^23 (as it has to be a power of 2), has a bad impact on the CPU usage of OpenNMS, and lead to long Full GCs very quickly (even knowing that the ring buffer is designed to avoid memory issues).

 

Acceptance / Success Criteria

None

Attachments

12

Lucidchart Diagrams

Activity

Show:

Jesse White May 14, 2018 at 12:35 AM
Edited

Enabled by default in foundation-2018 with df0b845fad08801afca6bd8a16f1f570faaf7cba.

Alejandro Galue May 8, 2018 at 8:21 PM

With the cache primer manually enabled, it works as expected (when I restart, the resource cache is filled up way quicker, and the ring buffer did not overflow).

On a interesting note, the resource cache graph was already filled up prior I saw the confirmation on the logs that the priming was finished.

Alejandro Galue May 8, 2018 at 5:15 PM

Makes sense, thanks! I'll try it again soon.

Jesse White May 8, 2018 at 5:12 PM

I chose not to enable it by default in foundation-2016 in order to maintain the existing behavior. You can enable it with org.opennms.newts.config.cache.priming.enable=true.

In foundation-2018 it will default to being enabled though.

Alejandro Galue May 8, 2018 at 2:53 PM

I've created a new AMI with the RPMs from the branch associated with the PR 1943:

It doesn't look like it is using the priming.block_ms. I didn't specified on opennms.properties, expecting that after restart OpenNMS, it would be blocked for 2min (which didn't happen), and the slope of the resource cache graph after the restart was the same, meaning it is not growing fast enough and indexing is happening again (and the ring buffer reaches the maximum like the first time).

I enabled DEBUG for eventd and restarted one more time, but nothing related with Newts-CachePrimer was shown. I've added the following to opennms.properties and restarted one more time:

Nothing shown on eventd.log.

Thoughts ?

Fixed

Details

Assignee

Reporter

Labels

Components

Sprint

Priority

PagerDuty

Created April 17, 2018 at 7:17 PM
Updated May 21, 2018 at 2:47 PM
Resolved May 14, 2018 at 12:37 AM