A Karaf issue with SSH sessions makes unusable the metrics:stress command

Description

Working with Drift 68, I was trying to use the metrics:stress tool to test the new Cassandra cluster a customer is building, and found that the SSH sessions against Karaf dies or get frozen after 10 minutes.

I've added sshIdleTimeout=0 to org.apache.karaf.shell.cfg and even connecting to Karaf using ssh -o ServerAliveInterval=60 admin@localhost -p 8101 and didn't help.

Digging on Internet I found the following:

https://issues.apache.org/jira/browse/KARAF-5473

It seems like we have to upgrade Karaf to be able to use the stress tool again, unless there is a workaround.

Acceptance / Success Criteria

None

Attachments

Lucidchart Diagrams

Activity

Show:

Christian Pape April 16, 2018 at 2:15 PM

Yes, I agree. This seems to have another cause. I tried it again and verified that the session stays open longer than 15 minutes.

Alejandro Galue April 13, 2018 at 4:04 PM
Edited

I've installed the RPMs I was using yesterday on the big EC2 instance, and the SSH session still dies. If I reduce the heap and the ring buffer to what I used yesterday, the connection stays alive more time, but eventually dies.

I found that it dies when the JVM is doing a Full GC, and I believe that this is due to the ServerAliveInterval setting. If I use a greater value (enough to survive the long Full GC I'm seeing), the session stays open.

To be honest, this seems to be a different problem, so I'll open another JIRA issue, after I dig more to understand what's happening with OpenNMS when using a big instance (lots of RAM and CPU).

In terms of this one, we're good.

Alejandro Galue April 13, 2018 at 3:43 PM
Edited

I don't know if it is a coincidence but today, the SSH connection fails (it dies), even knowing that it was working fine yesterday. I've attached the full log of the transaction.

About the differences:

Yesterday I was using medium size instance for OpenNMS (m4.4xlarge), and today I'm using large size instance (m4.10xlarge) with more heap and a bigger ring buffer (to emulate customer's environment).

Besides that, yesterday I was using the RPMs from the branch associated with the PR, and today I'm using the latest RPMs from release-21.1.0 (which includes the merge of this fix, as you can see on the file I've uploaded; in other words: opennms-core-21.1.0-0.20180412.onms2099.release.21.1.0.12.noarch).

Thoughts ?

Alejandro Galue April 12, 2018 at 7:41 PM

https://github.com/OpenNMS/opennms/pull/1918

Alejandro Galue April 5, 2018 at 12:49 PM

On a compilation from develop from 3 days ago (without your changes), I've tried "-o ServerAliveInterval=10", and it didn't work (which was expected, as I've tried that when I initially discovered the problem, as that is one of the suggested workarounds). In my case, the session didn't die, but became unresponsive or frozen after 10 minutes (no more feedback from the metrics:stress command, which I've configured to report every 30 seconds; and doesn't respond to Ctrl+C). This is what I've used:

Interestingly, even if I don't get feedback and I cannot resurrect the SSH session as it is frozen, I can see that the RRDs are still being updated like if the command is still running (at least for more than 10min after the SSH session became unresponsive). Without the ServerAliveInterval, the command stops after 10min because the SSH session has died.

This is what I've experienced at customer side and the reason why I've opened this issue.

Now, I've compiled your branch and executed the metrics:stress with the ServerAliveInterval flag:

So, I was able to see that the SSH session didn't die, meaning that even if we have to specify the ServerAliveInterval flag, it does the trick and we can now use the stress tool without issues again.

As you've mentioned, without the ServerAliveInterval flag, the session dies even with the new Karaf.

That being said, I think that for sure updating Karaf solved the problem, but we should update the official documentation to let the users know that we need that flag if you're expecting to use the SSH session for a long time (specially for those commands where time is important like the stress test).

Fixed

Details
Assignee
Christian Pape
Reporter
Alejandro Galue
Labels
support
Components
Sprint
None
Fix versions
21.1.0
Affects versions
21.0.5
Priority
Critical

PagerDuty

Created March 23, 2018 at 7:41 PM

Updated April 16, 2018 at 2:15 PM

Resolved April 13, 2018 at 12:45 AM

Configure

A Karaf issue with SSH sessions makes unusable the metrics:stress command

Description

Acceptance / Success Criteria

Attachments

Lucidchart Diagrams

Activity

Christian Pape April 16, 2018 at 2:15 PM

Alejandro Galue April 13, 2018 at 4:04 PMEdited

Alejandro Galue April 13, 2018 at 3:43 PMEdited

Alejandro Galue April 12, 2018 at 7:41 PM

Alejandro Galue April 5, 2018 at 12:49 PM

DetailsAssigneeChristian PapeChristian PapeReporterAlejandro GalueAlejandro GalueLabelssupportComponentsSprintNone+3Fix versions21.1.0Affects versions21.0.5PriorityCritical

Details

Assignee

Reporter

Labels

Components

Sprint

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Alejandro Galue April 13, 2018 at 4:04 PM
Edited

Alejandro Galue April 13, 2018 at 3:43 PM
Edited

Details
Assignee
Christian Pape
Reporter
Alejandro Galue
Labels
support
Components
Sprint
None
Fix versions
21.1.0
Affects versions
21.0.5
Priority
Critical

PagerDuty