Fixed
Details
Assignee
Dmitri HerdtDmitri HerdtReporter
Alejandro GalueAlejandro GalueHB Backlog Status
ReviewedComponents
Sprint
NoneFix versions
Affects versions
Priority
Major
Details
Details
Assignee
Dmitri Herdt
Dmitri HerdtReporter
Alejandro Galue
Alejandro GalueHB Backlog Status
Reviewed
Components
Sprint
None
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty
PagerDuty
Created November 24, 2020 at 10:36 PM
Updated February 7, 2023 at 2:22 PM
Resolved February 2, 2023 at 6:32 PM
A customer has opened multiple support tickets with different problems apparently unrelated that were affecting one of their environments.
As soon as the ES Forwarder problem was solved (as reported in ), all the other problems magically auto-resolve themselves.
I don't think this is just a coincidence, meaning that the ES Forwarder was affecting multiple OpenNMS features at once when it was unable to push content to Elasticsearch, regardless of the reason.
Among the problems observed and reported by the customers while the ES Forwarder was blocked are:
1) The Poller was unable to complete all polling cycles.
2) The Collector was unable to complete all collection cycles.
3) There were random and constant holes on all the graphs from all the collected metrics, even knowing that the ScyllaDB cluster was working perfectly fine and fast as usual.
4) All the minions (no exceptions) reported multiple short outages for the Heartbeat and the RPC services constantly.
5) There was a huge lag in Syslog's Sink topic, preventing OpenNMS from consuming those messages sent by the minions through Kafka.
6) The metrics injection rate into Newts was fluctuating (besides the interruptions mentioned in [3]), falling outside the constant rate which during normal conditions is 4K samples per second for this customer.
Perhaps there were more, but the above proves that whatever state entered the ES Forwarder impacted other unrelated features to work properly. There should be some defensive code in place to avoid this from happening in the future.