All work
- Notifications received despite Schedule Outage appliesNMS-13266Resolved issue: NMS-13266Chandra Gorantla
- Kafka producer uses resource name instead of ifIndex as the instance for InterfaceLevelResourceNMS-13185Resolved issue: NMS-13185Chandra Gorantla
- create a table to show related events in the alarm detail viewNMS-13170Resolved issue: NMS-13170Benjamin Reed
- Kafka Producer drops samples when the sending operation timeout.NMS-13167Resolved issue: NMS-13167Chandra Gorantla
- Document the Event TranslatorNMS-13121Resolved issue: NMS-13121Bonnie Robinson
Notifications received despite Schedule Outage applies
Description
Environment
Acceptance / Success Criteria
Attachments
Lucidchart Diagrams
Details
Assignee
Chandra GorantlaChandra GorantlaReporter
JianYetJianYetHB Backlog Status
Backlog NGComponents
Sprint
NoneFix versions
Affects versions
Priority
Medium
Details
Details
Assignee
Reporter
HB Backlog Status
Components
Sprint
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty
PagerDuty
Activity
JianYetMay 19, 2021 at 5:04 PM
Verified that in both cases above, the notification was not sent. The functionality is working as expected.
JianYetMay 19, 2021 at 2:44 PMEdited
@Chandra Gorantla The PR has stop DOWN notifications created during a scheduled outage from being sent to certain extent. It seems that when duty schedule is in place for a group, the notification is still being queued even though I've tried to clear the queue by auto-acknowledging it. Here's my configuration and steps to reproduce.
Group config
<group>
<name>Users</name>
<user>user1</user>
<user>user2</user>
<duty-schedule>Th900-1700</duty-schedule>
</group>
Schedule Outage config
<outage name="test outage" type="specific">
<time begins="19-May-2021 00:00:00" ends="19-May-2021 09:59:59"/>
<node id="1"/>
</outage>
Steps to reproduce
send-event.pl -n 1 uei.opennms.org/nodes/nodeLostService -s SNMP -i 192.x.y.z
send-event.pl -n 1 uei.opennms.org/nodes/nodeRegainedService -s SNMP -i 192.x.y.z
Logs in debug mode shows that acknowledging it didn't clear the queue
2021-05-19 09:50:36,111 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: num message : 111-6
2021-05-19 09:50:36,111 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: subject : Notice #6: SNMP down on 192.x.y.z (192.x.y.z) on node onms01.
2021-05-19 09:50:36,111 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: node : 1
2021-05-19 09:50:36,111 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: interface : 192.x.y.z
2021-05-19 09:50:36,111 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: service : SNMP
2021-05-19 09:50:36,111 INFO [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: Inserting notification #6 into database: Notice #6: SNMP down on 192.x.y.z (192.x.y.z) on node onms01.
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.BasicScheduleUtils: isTimeInOutage: checking for time '2021-05-19T09:50:36.114-0400' in schedule 'test outage'
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.BasicScheduleUtils: isTimeInOutage: checking begin/end time...
current: 2021-05-19T09:50:36.114-0400
begin: 2021-05-19T00:00:00.000-0400
end: 2021-05-19T09:40:59.000-0400
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: Processing target Users:0m
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: day of week is 2
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: Remainder of today is not in schedule
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: day 3 is in schedule
2021-05-19 09:50:36,114 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: day 3 is 1 from today
2021-05-19 09:50:36,115 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: duty begins in 83363887 millisecs
2021-05-19 09:50:36,115 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.GroupManager: isGroupOnDuty: On duty in 0 millisec from schedule 83363887
2021-05-19 09:50:36,115 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: The group Users is on duty in 83363887 millisec.
2021-05-19 09:50:36,115 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.NoticeQueue: task queued for notifyID 6
2021-05-19 09:50:36,115 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.NoticeQueue: task queued for notifyID 6
2021-05-19 09:50:51,249 DEBUG [DefaultQueueHandler-default] o.o.n.n.DefaultQueueHandler: current state of tree: {1621515600000=[Send javaEmail/#6 to user1 at Thu May 20 09:00:00 EDT 2021, Send javaEmail/#6 to user2 at Thu May 20 09:00:00 EDT 2021]}
2021-05-19 09:51:11,250 DEBUG [DefaultQueueHandler-default] o.o.n.n.DefaultQueueHandler: current state of tree: {1621515600000=[Send javaEmail/#6 to user1 at Thu May 20 09:00:00 EDT 2021, Send javaEmail/#6 to user2 at Thu May 20 09:00:00 EDT 2021]}
2021-05-19 09:51:31,255 DEBUG [DefaultQueueHandler-default] o.o.n.n.DefaultQueueHandler: current state of tree: {1621515600000=[Send javaEmail/#6 to user1 at Thu May 20 09:00:00 EDT 2021, Send javaEmail/#6 to user2 at Thu May 20 09:00:00 EDT 2021]}
2021-05-19 09:51:40,471 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.e.EventIpcManagerDefaultImpl: run: calling onEvent on Notifd:BroadcastEventProcessor for event {"uei":"uei.opennms.org/nodes/nodeRegainedService","time":"2021-05-19 09:51:40,000-0400","dbid":22,"source":"perl_send_event","nodeid":1,"parms":[]}
2021-05-19 09:51:40,471 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: No notice match for uei: uei.opennms.org/nodes/nodeRegainedService
2021-05-19 09:51:40,471 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: Acknowledging event uei.opennms.org/nodes/nodeLostService 1:192.x.y.z:SNMP
2021-05-19 09:51:40,474 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.NotificationManager: EventID for notice(s) to be acked: 21
2021-05-19 09:51:40,476 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.NotificationManager: Matching DOWN notifyID = 6, was acked by user = false, ansBy = auto-acknowledged
2021-05-19 09:51:51,257 DEBUG [DefaultQueueHandler-default] o.o.n.n.DefaultQueueHandler: current state of tree: {1621515600000=[Send javaEmail/#6 to user1 at Thu May 20 09:00:00 EDT 2021, Send javaEmail/#6 to user2 at Thu May 20 09:00:00 EDT 2021]}
On the other hand, when the group was currently on duty and when I auto-acknowledged it, the notification didn't stay in the queue.
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: Processing target Users:0m
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: day of week is 2
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: Today is in schedule
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.u.DutySchedule: nextInSchedule: on duty now
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.GroupManager: isGroupOnDuty: On duty in 0 millisec from schedule 0
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: The group Users is on duty in 0 millisec.
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.NoticeQueue: task queued for notifyID 8
2021-05-19 10:51:16,869 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.NoticeQueue: task queued for notifyID 8
2021-05-19 10:51:25,345 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.e.EventIpcManagerDefaultImpl: run: calling onEvent on Notifd:BroadcastEventProcessor for event {"uei":"uei.opennms.org/nodes/nodeRegainedService","time":"2021-05-19 10:51:25,000-0400","dbid":26,"source":"perl_send_event","nodeid":1,"parms":[]}
2021-05-19 10:51:25,345 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: No notice match for uei: uei.opennms.org/nodes/nodeRegainedService
2021-05-19 10:51:25,345 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.n.BroadcastEventProcessor: Acknowledging event uei.opennms.org/nodes/nodeLostService 1:192.168.211.100:SNMP
2021-05-19 10:51:25,346 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.NotificationManager: EventID for notice(s) to be acked: 25
2021-05-19 10:51:25,346 DEBUG [Notifd:BroadcastEventProcessor-Thread] o.o.n.c.NotificationManager: Matching DOWN notifyID = 8, was acked by user = false, ansBy = auto-acknowledged
2021-05-19 10:51:32,284 DEBUG [DefaultQueueHandler-default] o.o.n.n.DefaultQueueHandler: current state of tree: {1621515600000=[Send javaEmail/#6 to user1 at Thu May 20 09:00:00 EDT 2021, Send javaEmail/#6 to user2 at Thu May 20 09:00:00 EDT 2021]}
In addition, I wasn't able to test the On-Call role at the moment due to https://opennms.atlassian.net/browse/NMS-13287#icft=NMS-13287.
Chandra GorantlaMay 14, 2021 at 3:57 PM
@JianYet Can you check if above PR fixes the issues that you are seeing. You can get artifacts from CircleCI
JianYetMay 12, 2021 at 1:33 PM
Reopening to further discuss the appropriate default behavior of notifd when schedule outage applies. As far as users concern (see also https://opennms.atlassian.net/browse/NMS-5492#icft=NMS-5492 and https://opennms.atlassian.net/browse/NMS-8447#icft=NMS-8447), schedule outage means no notifications at all regardless whether the DOWN events were acknowledged or not. Sending DOWN notifications when the schedule outage expires like 14 hours later would be confusing and undesired. That said, it would be counter-intuitive to acknowledge it manually or auto-acknowledge it for the sake of a scheduled outage. Moreover, removing the notifications/events from notifd-configuration.xml as a workaround means opting out for the nice feature of auto-acknowledge so it's kind of a "band-aid" instead of a permanent fix.
This is an issue in continuation to https://opennms.atlassian.net/browse/NMS-5492#icft=NMS-5492 and https://opennms.atlassian.net/browse/NMS-8447#icft=NMS-8447 as it remains unresolved. This issue is brought up by customer. It's reproducible in Meridian 2019.1.16 and Horizon 27.1.0. In my case, I had notifications sent to on-call role OnCall1. Turned on the notifd debug logging. Then, generated a node down event during a schedule outage. The down notification was sent when the schedule outage expires. The results is in notifd.log (look for notification id 38 at the end of the file). In addition to this, customer mentioned that even when the schedule outage is removed before it expries, the down notification is still being sent. Verified that this behavior applies. Attached the logs and configurations.