All work

Select view

Select search mode

 

Notifications received despite Schedule Outage applies

Fixed

Description

This is an issue in continuation to https://opennms.atlassian.net/browse/NMS-5492#icft=NMS-5492 and https://opennms.atlassian.net/browse/NMS-8447#icft=NMS-8447 as it remains unresolved. This issue is brought up by customer. It's reproducible in Meridian 2019.1.16 and Horizon 27.1.0. In my case, I had notifications sent to on-call role OnCall1. Turned on the notifd debug logging. Then, generated a node down event during a schedule outage. The down notification was sent when the schedule outage expires. The results is in notifd.log (look for notification id 38 at the end of the file). In addition to this, customer mentioned that even when the schedule outage is removed before it expries, the down notification is still being sent. Verified that this behavior applies. Attached the logs and configurations.

Environment

CentOS Linux 7 (Core), 3.10.0-1160.6.1.el7.x86_64, openjdk version "11.0.11", CentOS Linux 8, 4.18.0-240.15.1.el8_3.x86_64, openjdk version "11.0.9.1"

Acceptance / Success Criteria

None

Attachments

1
  • 28 Apr 2021, 06:46 PM

Lucidchart Diagrams

Details

Assignee

Reporter

HB Backlog Status

Sprint

Fix versions

Affects versions

Priority

PagerDuty

Created April 28, 2021 at 6:51 PM
Updated May 19, 2021 at 5:04 PM
Resolved May 19, 2021 at 1:44 PM

Activity

JianYetMay 19, 2021 at 5:04 PM

Verified that in both cases above, the notification was not sent. The functionality is working as expected. 

JianYetMay 19, 2021 at 2:44 PM
Edited

 The PR has stop DOWN notifications created during a scheduled outage from being sent to certain extent. It seems that when duty schedule is in place for a group, the notification is still being queued even though I've tried to clear the queue by auto-acknowledging it. Here's my configuration and steps to reproduce.

Group config

 

<group> <name>Users</name> <user>user1</user> <user>user2</user> <duty-schedule>Th900-1700</duty-schedule> </group>

 

 

Schedule Outage config

 

<outage name="test outage" type="specific"> <time begins="19-May-2021 00:00:00" ends="19-May-2021 09:59:59"/> <node id="1"/> </outage>

 

 

Steps to reproduce

 

send-event.pl -n 1 uei.opennms.org/nodes/nodeLostService -s SNMP -i 192.x.y.z send-event.pl -n 1 uei.opennms.org/nodes/nodeRegainedService -s SNMP -i 192.x.y.z

 

 

Logs in debug mode shows that acknowledging it didn't clear the queue

2021-05-19 09:50:36,111 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: num message : 111-6
2021-05-19 09:50:36,111 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: subject : Notice #6: SNMP down on 192.x.y.z (192.x.y.z) on node onms01.
2021-05-19 09:50:36,111 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: node : 1
2021-05-19 09:50:36,111 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: interface : 192.x.y.z
2021-05-19 09:50:36,111 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: service : SNMP
2021-05-19 09:50:36,111 INFO [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: Inserting notification #6 into database: Notice #6: SNMP down on 192.x.y.z (192.x.y.z) on node onms01.
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.BasicScheduleUtils: isTimeInOutage: checking for time '2021-05-19T09:50:36.114-0400' in schedule 'test outage'
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.BasicScheduleUtils: isTimeInOutage: checking begin/end time...
current: 2021-05-19T09:50:36.114-0400
begin: 2021-05-19T00:00:00.000-0400
end: 2021-05-19T09:40:59.000-0400
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: Processing target Users:0m
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: day of week is 2
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: Remainder of today is not in schedule
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: day 3 is in schedule
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: day 3 is 1 from today
2021-05-19 09:50:36,115 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: duty begins in 83363887 millisecs
2021-05-19 09:50:36,115 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.GroupManager: isGroupOnDuty: On duty in 0 millisec from schedule 83363887
2021-05-19 09:50:36,115 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: The group Users is on duty in 83363887 millisec.
2021-05-19 09:50:36,115 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.NoticeQueue: task queued for notifyID 6
2021-05-19 09:50:36,115 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.NoticeQueue: task queued for notifyID 6
2021-05-19 09:50:51,249 DEBUG [DefaultQueueHandler-default] o.o.n.n.DefaultQueueHandler: current state of tree: {1621515600000=[Send javaEmail/#6 to user1 at Thu May 20 09:00:00 EDT 2021, Send javaEmail/#6 to user2 at Thu May 20 09:00:00 EDT 2021]}
2021-05-19 09:51:11,250 DEBUG [DefaultQueueHandler-default] o.o.n.n.DefaultQueueHandler: current state of tree: {1621515600000=[Send javaEmail/#6 to user1 at Thu May 20 09:00:00 EDT 2021, Send javaEmail/#6 to user2 at Thu May 20 09:00:00 EDT 2021]}
2021-05-19 09:51:31,255 DEBUG [DefaultQueueHandler-default] o.o.n.n.DefaultQueueHandler: current state of tree: {1621515600000=[Send javaEmail/#6 to user1 at Thu May 20 09:00:00 EDT 2021, Send javaEmail/#6 to user2 at Thu May 20 09:00:00 EDT 2021]}
2021-05-19 09:51:40,471 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.e.EventIpcManagerDefaultImpl: run: calling onEvent on Notifd:BroadcastEventProcessor for event {"uei":"uei.opennms.org/nodes/nodeRegainedService","time":"2021-05-19 09:51:40,000-0400","dbid":22,"source":"perl_send_event","nodeid":1,"parms":[]}
2021-05-19 09:51:40,471 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: No notice match for uei: uei.opennms.org/nodes/nodeRegainedService
2021-05-19 09:51:40,471 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: Acknowledging event uei.opennms.org/nodes/nodeLostService 1:192.x.y.z:SNMP
2021-05-19 09:51:40,474 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.NotificationManager: EventID for notice(s) to be acked: 21
2021-05-19 09:51:40,476 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.NotificationManager: Matching DOWN notifyID = 6, was acked by user = false, ansBy = auto-acknowledged
2021-05-19 09:51:51,257 DEBUG [DefaultQueueHandler-default] o.o.n.n.DefaultQueueHandler: current state of tree: {1621515600000=[Send javaEmail/#6 to user1 at Thu May 20 09:00:00 EDT 2021, Send javaEmail/#6 to user2 at Thu May 20 09:00:00 EDT 2021]}

 

On the other hand, when the group was currently on duty and when I auto-acknowledged it, the notification didn't stay in the queue.  

2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: Processing target Users:0m
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: day of week is 2
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: Today is in schedule
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: on duty now
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.GroupManager: isGroupOnDuty: On duty in 0 millisec from schedule 0
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: The group Users is on duty in 0 millisec.
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.NoticeQueue: task queued for notifyID 8
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.NoticeQueue: task queued for notifyID 8
2021-05-19 10:51:25,345 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.e.EventIpcManagerDefaultImpl: run: calling onEvent on Notifd:BroadcastEventProcessor for event {"uei":"uei.opennms.org/nodes/nodeRegainedService","time":"2021-05-19 10:51:25,000-0400","dbid":26,"source":"perl_send_event","nodeid":1,"parms":[]}
2021-05-19 10:51:25,345 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: No notice match for uei: uei.opennms.org/nodes/nodeRegainedService
2021-05-19 10:51:25,345 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: Acknowledging event uei.opennms.org/nodes/nodeLostService 1:192.168.211.100:SNMP
2021-05-19 10:51:25,346 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.NotificationManager: EventID for notice(s) to be acked: 25
2021-05-19 10:51:25,346 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.NotificationManager: Matching DOWN notifyID = 8, was acked by user = false, ansBy = auto-acknowledged
2021-05-19 10:51:32,284 DEBUG [DefaultQueueHandler-default] o.o.n.n.DefaultQueueHandler: current state of tree: {1621515600000=[Send javaEmail/#6 to user1 at Thu May 20 09:00:00 EDT 2021, Send javaEmail/#6 to user2 at Thu May 20 09:00:00 EDT 2021]}

 

In addition, I wasn't able to test the On-Call role at the moment due to https://opennms.atlassian.net/browse/NMS-13287#icft=NMS-13287.   

Chandra GorantlaMay 14, 2021 at 3:57 PM

  Can you check if above PR fixes the issues that you are seeing.  You can get artifacts from CircleCI

Chandra GorantlaMay 13, 2021 at 8:07 PM

JianYetMay 12, 2021 at 1:33 PM

Reopening to further discuss the appropriate default behavior of notifd when schedule outage applies. As far as users concern (see also https://opennms.atlassian.net/browse/NMS-5492#icft=NMS-5492 and https://opennms.atlassian.net/browse/NMS-8447#icft=NMS-8447), schedule outage means no notifications at all regardless whether the DOWN events were acknowledged or not. Sending DOWN notifications when the schedule outage expires like 14 hours later would be confusing and undesired. That said, it would be counter-intuitive to acknowledge it manually or auto-acknowledge it for the sake of a scheduled outage. Moreover, removing the notifications/events from notifd-configuration.xml as a workaround means opting out for the nice feature of auto-acknowledge so it's kind of a "band-aid" instead of a permanent fix.