Performance degradation compared to H29

Description

We see severe performance issues on the latest H30 when using Newts/Cassandra.

I made some tests, and I verified that H29.0.9 with 8 Cores and 32 GB of RAM (16 for the heap) could sustain 50K samples per second against a Cassandra cluster using stress-metrics -n 2000 -i 20 -t 8.

I upgraded that VM to the latest H30 and re-executed that stress command without changing anything else, and now it can barely go over 33K samples per second. CPU usage is similar across both versions (slightly higher on H30), and running as root or non-root doesn't make any difference.

With the OIA TSS Layer using the noops Plugin and four times more computing power, H30 hangs after 10 minutes (becomes unusable due to severe slowness and starvation in terms of CPU usage), but it's most probable unrelated and separated issue with OIA layer.

Acceptance / Success Criteria

None

Attachments

29 Apr 2022, 03:37 PM

Linked issues

related to

NMS-8861

Admin guide lacks a chapter on logging

Lucidchart Diagrams

Activity

Show:

Alejandro Galue May 13, 2022 at 5:20 PM

I always set WARN when testing non-stable releases because I'm aware of the performance issues and because having DEBUG by default doesn't make sense.

Interestingly, I tried the latest RPMs from the horizon-30x branch on my Azure lab (set the log level to warn, as always), and not sure what changed, but now I can see it working (perhaps the planet alignment or the upcoming lunar eclipse have helped). It can sustain 50K samples per second.

It seems to be working, and having the documentation changes doesn't hurt, so I'm OK with it.

fooker May 10, 2022 at 2:20 PM

PR: https://github.com/OpenNMS/opennms/pull/4678

fooker May 10, 2022 at 12:11 PM

This is due to different log levels in release branches. Setting the level to WARN fixes this.

Jeff Gehlbach May 5, 2022 at 8:44 PM

This problem blocks 30.0.0.

Alejandro Galue May 3, 2022 at 3:01 PM

Here are more details about the test environment I used:

https://github.com/agalue/cassandra-azure

Unfortunately, things changed in Terraform, the Azure Plugin, and Ansible since I wrote that repository. For that reason, the lab requires manual intervention to have it working, so I will explain what it does to reproduce the problem correctly.

We need four machines with 8 Cores and 32 GB of RAM (which is why vars.tf references Standard_D8s_v5); one for OpenNMS and 3 for ScyllaDB or Cassandra 4 (although I used ScyllaDB as it is faster, has less latency, and more throughput than Cassandra).

Terraform is used to spin up the Resource Group, the VNET, and the VMs. Then it copies the Ansible playbooks to the OpenNMS server and runs it from it. Ansible will install and configure OpenNMS, PostgreSQL, Cassandra, or ScyllaDB. That is the part that needs manual intervention.

To manually recreate the environment without Terraform/Ansible, look at how the applications are configured via Ansible, especially OpenNMS regarding Newts (Heap Size, Ring Buffer, Resource Cache).

Once you have that in place, run the stress-metrics command and see that with H29, it works. Repeat the process with H30, and you'll see it is slower (meaning the performance is degraded, and it shouldn't).

Speaking of versions, the lab installs H29 from yum.opennms.org. That location doesn't offer a way to install packages for "develop snapshots", so you have to manually install OpenNMS via CloudSmith; or upgrade the H29 instance and repeat the test.

Not a Bug

Details
Assignee
fooker
Reporter
Alejandro Galue
Story Points
8
Components
Data Collection
Data Output - Newts
Sprint
None
Fix versions
30.0.0
Affects versions
30.0.0
Priority
Blocker

PagerDuty

Created April 29, 2022 at 3:37 PM

Updated May 30, 2022 at 3:22 PM

Resolved May 17, 2022 at 12:08 PM

Performance degradation compared to H29

Description

Acceptance / Success Criteria

Attachments

Linked issues

related to

Lucidchart Diagrams

Activity

Alejandro Galue May 13, 2022 at 5:20 PM

fooker May 10, 2022 at 2:20 PM

fooker May 10, 2022 at 12:11 PM

Jeff Gehlbach May 5, 2022 at 8:44 PM

Alejandro Galue May 3, 2022 at 3:01 PM

Details
Assignee
fooker
Reporter
Alejandro Galue
Story Points
8
Components
Data Collection
Data Output - Newts
Sprint
None
Fix versions
30.0.0
Affects versions
30.0.0
Priority
Blocker

Details

Assignee

Reporter

Story Points

Components

Sprint

Fix versions

Affects versions

Priority

PagerDuty

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

Performance degradation compared to H29

Description

Acceptance / Success Criteria

Attachments

Linked issues

related to

Lucidchart Diagrams

Activity

Alejandro Galue May 13, 2022 at 5:20 PM

fooker May 10, 2022 at 2:20 PM

fooker May 10, 2022 at 12:11 PM

Jeff Gehlbach May 5, 2022 at 8:44 PM

Alejandro Galue May 3, 2022 at 3:01 PM

DetailsAssigneefookerfookerReporterAlejandro GalueAlejandro GalueStory Points8ComponentsData CollectionData Output - NewtsSprintNone+2Fix versions30.0.0Affects versions30.0.0PriorityBlocker

Details

Assignee

Reporter

Story Points

Components

Sprint

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
fooker
Reporter
Alejandro Galue
Story Points
8
Components
Data Collection
Data Output - Newts
Sprint
None
Fix versions
30.0.0
Affects versions
30.0.0
Priority
Blocker

PagerDuty