Performance degradation compared to H29
Description
Acceptance / Success Criteria
Attachments
- 29 Apr 2022, 03:37 PM
related to
Lucidchart Diagrams
Activity
Alejandro Galue May 13, 2022 at 5:20 PM
I always set WARN when testing non-stable releases because I'm aware of the performance issues and because having DEBUG by default doesn't make sense.
Interestingly, I tried the latest RPMs from the horizon-30x branch on my Azure lab (set the log level to warn, as always), and not sure what changed, but now I can see it working (perhaps the planet alignment or the upcoming lunar eclipse have helped). It can sustain 50K samples per second.
It seems to be working, and having the documentation changes doesn't hurt, so I'm OK with it.
fooker May 10, 2022 at 12:11 PM
This is due to different log levels in release branches. Setting the level to WARN fixes this.
Jeff Gehlbach May 5, 2022 at 8:44 PM
This problem blocks 30.0.0.
Alejandro Galue May 3, 2022 at 3:01 PM
Here are more details about the test environment I used:
https://github.com/agalue/cassandra-azure
Unfortunately, things changed in Terraform, the Azure Plugin, and Ansible since I wrote that repository. For that reason, the lab requires manual intervention to have it working, so I will explain what it does to reproduce the problem correctly.
We need four machines with 8 Cores and 32 GB of RAM (which is why vars.tf
references Standard_D8s_v5
); one for OpenNMS and 3 for ScyllaDB or Cassandra 4 (although I used ScyllaDB as it is faster, has less latency, and more throughput than Cassandra).
Terraform is used to spin up the Resource Group, the VNET, and the VMs. Then it copies the Ansible playbooks to the OpenNMS server and runs it from it. Ansible will install and configure OpenNMS, PostgreSQL, Cassandra, or ScyllaDB. That is the part that needs manual intervention.
To manually recreate the environment without Terraform/Ansible, look at how the applications are configured via Ansible, especially OpenNMS regarding Newts (Heap Size, Ring Buffer, Resource Cache).
Once you have that in place, run the stress-metrics
command and see that with H29, it works. Repeat the process with H30, and you'll see it is slower (meaning the performance is degraded, and it shouldn't).
Speaking of versions, the lab installs H29 from yum.opennms.org
. That location doesn't offer a way to install packages for "develop snapshots", so you have to manually install OpenNMS via CloudSmith; or upgrade the H29 instance and repeat the test.
Details
Assignee
fookerfookerReporter
Alejandro GalueAlejandro GalueStory Points
8Components
Sprint
NoneFix versions
Affects versions
Priority
Blocker
Details
Details
Assignee
Reporter
Story Points
Components
Sprint
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

We see severe performance issues on the latest H30 when using Newts/Cassandra.
I made some tests, and I verified that H29.0.9 with 8 Cores and 32 GB of RAM (16 for the heap) could sustain 50K samples per second against a Cassandra cluster using
stress-metrics -n 2000 -i 20 -t 8
.I upgraded that VM to the latest H30 and re-executed that stress command without changing anything else, and now it can barely go over 33K samples per second. CPU usage is similar across both versions (slightly higher on H30), and running as root or non-root doesn't make any difference.
With the OIA TSS Layer using the
noops
Plugin and four times more computing power, H30 hangs after 10 minutes (becomes unusable due to severe slowness and starvation in terms of CPU usage), but it's most probable unrelated and separated issue with OIA layer.