A Karaf issue with SSH sessions makes unusable the metrics:stress command
Description
Acceptance / Success Criteria
Attachments
Lucidchart Diagrams
Activity

Christian Pape April 16, 2018 at 2:15 PM
Yes, I agree. This seems to have another cause. I tried it again and verified that the session stays open longer than 15 minutes.

Alejandro Galue April 13, 2018 at 4:04 PMEdited
I've installed the RPMs I was using yesterday on the big EC2 instance, and the SSH session still dies. If I reduce the heap and the ring buffer to what I used yesterday, the connection stays alive more time, but eventually dies.
I found that it dies when the JVM is doing a Full GC, and I believe that this is due to the ServerAliveInterval setting. If I use a greater value (enough to survive the long Full GC I'm seeing), the session stays open.
To be honest, this seems to be a different problem, so I'll open another JIRA issue, after I dig more to understand what's happening with OpenNMS when using a big instance (lots of RAM and CPU).
In terms of this one, we're good.

Alejandro Galue April 13, 2018 at 3:43 PMEdited
I don't know if it is a coincidence but today, the SSH connection fails (it dies), even knowing that it was working fine yesterday. I've attached the full log of the transaction.
About the differences:
Yesterday I was using medium size instance for OpenNMS (m4.4xlarge), and today I'm using large size instance (m4.10xlarge) with more heap and a bigger ring buffer (to emulate customer's environment).
Besides that, yesterday I was using the RPMs from the branch associated with the PR, and today I'm using the latest RPMs from release-21.1.0 (which includes the merge of this fix, as you can see on the file I've uploaded; in other words: opennms-core-21.1.0-0.20180412.onms2099.release.21.1.0.12.noarch).
Thoughts ?

Alejandro Galue April 5, 2018 at 12:49 PM
On a compilation from develop from 3 days ago (without your changes), I've tried "-o ServerAliveInterval=10", and it didn't work (which was expected, as I've tried that when I initially discovered the problem, as that is one of the suggested workarounds). In my case, the session didn't die, but became unresponsive or frozen after 10 minutes (no more feedback from the metrics:stress command, which I've configured to report every 30 seconds; and doesn't respond to Ctrl+C). This is what I've used:
Interestingly, even if I don't get feedback and I cannot resurrect the SSH session as it is frozen, I can see that the RRDs are still being updated like if the command is still running (at least for more than 10min after the SSH session became unresponsive). Without the ServerAliveInterval, the command stops after 10min because the SSH session has died.
This is what I've experienced at customer side and the reason why I've opened this issue.
Now, I've compiled your branch and executed the metrics:stress with the ServerAliveInterval flag:
So, I was able to see that the SSH session didn't die, meaning that even if we have to specify the ServerAliveInterval flag, it does the trick and we can now use the stress tool without issues again.
As you've mentioned, without the ServerAliveInterval flag, the session dies even with the new Karaf.
That being said, I think that for sure updating Karaf solved the problem, but we should update the official documentation to let the users know that we need that flag if you're expecting to use the SSH session for a long time (specially for those commands where time is important like the stress test).
Working with Drift 68, I was trying to use the metrics:stress tool to test the new Cassandra cluster a customer is building, and found that the SSH sessions against Karaf dies or get frozen after 10 minutes.
I've added
sshIdleTimeout=0
toorg.apache.karaf.shell.cfg
and even connecting to Karaf usingssh -o ServerAliveInterval=60 admin@localhost -p 8101
and didn't help.Digging on Internet I found the following:
https://issues.apache.org/jira/browse/KARAF-5473
It seems like we have to upgrade Karaf to be able to use the stress tool again, unless there is a workaround.