Unresponsive services do not generate nodeLostService messages
Description
Environment
Acceptance / Success Criteria
Lucidchart Diagrams
Activity
Tarus Balog March 17, 2005 at 2:26 PM
Fixed in stable CVS.
Tarus Balog March 10, 2005 at 4:47 PM
Okay, the code to fix this had been committed to stable.
Here are the test cases:
With nodeUnresponsive set to false:
1) Stop the service
2) Make sure that a nodeLostService event is received
3) Start the service
4) Make sure that a nodeRegainedService event is received.
5) Configure the service to be "unresponsive" - i.e. a connection can be made but expected behavior
does not occur before the timeout (I did this by running httpd on port 25 so the SMTP poller would
make the connection but not get a HELO).
6) Make sure that a nodeLostService event is received.
7) Restore the configuration
8) Make sure that a nodeRegainedService event is received.
With nodeUnresponsive set to true:
1) Stop the service
2) Make sure that a nodeLostService event is received
3) Start the service
4) Make sure that a nodeRegainedService event is received.
5) Configure the service to be "unresponsive" - i.e. a connection can be made but expected behavior
does not occur before the timeout (I did this by running httpd on port 25 so the SMTP poller would
make the connection but not get a HELO).
6) Make sure that a serviceUnresponsive event is received, and no nodeLostService event is generated
7) Restore the configuration
8) Make sure that a serviceResponsive event is received.
Tarus Balog March 10, 2005 at 2:06 PM
Ted - in OpenNMS a service is down if we either can't connect to it or it doesn't respond as expected
in a particular amount of time. In the situation I describe, the server was frozen, dead, passed on,
singing in the choir invisible.
But it still responded to pings and port connections. The SMTP service should have sent a HELO, but it
didn't, so it is down.
However, there are some users who only want their availability affected if the service is totally
unreachable. In that case, they can turn on "serviceUnresponsive" in the poller configuration. This will
generate a serviceUnresponsive/serviceResponsive event pair instead of nodeLostService/
nodeRegainedService.
This is due to some changes in the poller code. Should be corrected soon.
Ted Kaczmarek March 8, 2005 at 7:08 PM
If the service is not down, why would you want it marked as down. I think
unresponsive is indeed an accurate reflection on the state of the service.
For critical machines you probably want to add notifications for unreponsive
services. Your log does indeed show that the
SmtpMonitor: connected to host: /172.20.0.177 on port: 25
than the service is up.
Details
Assignee
OpenNMS Bug Mailing ListOpenNMS Bug Mailing ListReporter
Tarus BalogTarus BalogComponents
Fix versions
Affects versions
Priority
Major
Details
Details
Assignee
Reporter
Components
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

We had a server get very busy last night (no pun intended) and the SMTP service would not send a HELO
within the timeout. The logs show the service as unresponsive:
2005-03-04 03:25:34,569 DEBUG [PollerScheduler-30 Pool-fiber1] PollableServiceConfig: Polling
172.20.0.177:SMTP using pkg example1
2005-03-04 03:25:34,569 DEBUG [PollerScheduler-30 Pool-fiber1] SmtpMonitor: poll: address =
172.20.0.177, port = 25, timeout = 3000, retry = 1
2005-03-04 03:25:34,587 DEBUG [PollerScheduler-30 Pool-fiber1] SmtpMonitor: SmtpMonitor:
connected to host: /172.20.0.177 on port: 25
2005-03-04 03:25:34,697 DEBUG [OpenNMS.Poller.DefaultPollContext]
EventIpcManagerDefaultImpl$ListenerThread: run: calling onEvent on
OpenNMS.Poller.DefaultPollContext for event uei.opennms.org/generic/traps/EnterpriseDefault
2005-03-04 03:25:37,588 DEBUG [PollerScheduler-30 Pool-fiber1] SmtpMonitor: SmtpMonitor: did not
connect to host within timeout: 3000 attempt: 0
2005-03-04 03:25:37,607 DEBUG [PollerScheduler-30 Pool-fiber1] SmtpMonitor: SmtpMonitor:
connected to host: /172.20.0.177 on port: 25
2005-03-04 03:25:40,607 DEBUG [PollerScheduler-30 Pool-fiber1] SmtpMonitor: SmtpMonitor: did not
connect to host within timeout: 3000 attempt: 1
2005-03-04 03:25:40,608 DEBUG [PollerScheduler-30 Pool-fiber1] PollableServiceConfig: Finish
polling 9:172.20.0.177:SMTP using pkg example1 result =Unresponsive
2005-03-04 03:25:40,608 DEBUG [PollerScheduler-30 Pool-fiber1] PollableNode$Lock: Releasing lock
for 9
This should have generated a nodeLostService event.
Note: The IP addresses have been changed to protect the innocent.