The behavior of the Ticketing API differs from older versions.

Description

I discovered that the Ticketing Integration usage differs since the introduction of Drools to handle the alarm life-cycle within Drools. That major change was supposed to preserve the behavior, and the rules are equivalent to what used to be handled via Vacuumd. Unfortunately, something introduced unexpected results in the behavior.

I decided to test all Meridian versions, starting with M2016 until M2020. In this scenario, M2016, M2017, and M2018 use the old behavior for alarm life-cycle and interaction with the Ticketing API implementations, whereas M2019 and M2020 use the new one that relies upon Drools.

I found that the behavior for the four use cases I tested (described below) works as intended in M2016, M2017, and M2018 (despite minor issues I found, which I'll describe later). Unfortunately, replicating the same scenarios in M2019 and M2020 didn't produce the same results.

As we need an actual Ticketing System to test the solution, I decided to use the JIRA Implementation because I know that it has been battle-tested on multiple customers. That's the only one I can use at the moment. In theory, the actual implementation is irrelevant as the problem seems to be related to the alarm life-cycle management.

If someone identifies a problem with the test scenarios or something that I forgot to configure in M2019 or M2020, please let me know. Let me know if further tests are required, as I kept all the VMs I used available.

Here are the four use cases I tried on each version:

Use Case 1 (create/close): Automations disabled in Vacuumd or Alarmd/Drools (default). The integration requires manual work

Send nodeLostService event.
Manually create Ticket from alarm page, verify JIRA, and check if the Issue ID was stored in the database.
Send nodeRagainedService event and wait until the automation marks the trigger alarm as cleared.
Close Ticket from alarm page and verify JIRA; also verify that the state was updated in the database.
Wait 5 minutes and verify that the alarm is removed from the database.

Use Case 2 (update): Automations disabled in Vacuumd or Alarmd/Drools. The integration doesn't require manual work

Send nodeLostService event.
Manually create Ticket from alarm page, verify JIRA, and check if the Issue ID was stored in the database.
From JIRA, close the ticket.
From the alarm page, request an update of the ticket.
Verify that the ticket state in the database is marked as closed.
Send nodeRagainedService event and wait until the automation marks the trigger alarm as cleared. As the state of the ticket is closed, it should be automatically removed within 5 min.

Use Case 3 (create/close): Automations enabled in Vacuumd or Alarmd/Drools. The integration doesn't require manual work

Send nodeLostService event.
Wait 15min and check the alarm details page to see if the Ticket was created, and check if the Issue ID was stored in the database. Also, the alarm should be marked as acknowledged automatically.
Send nodeRagainedService event and wait until the automation marks the trigger alarm as cleared.
Wait 15min and check if the Ticket is closed automatically; verify that the state was updated in the database and the JIRA issue was updated.
Unacknowledge the alarm to speed up the removal process.
Wait until the alarm is removed from the database.

Use Case 4 (update): Automations enabled in Vacuumd or Alarmd/Drools. The integration doesn't require manual work

Send nodeLostService event.
Wait 15min and check the alarm details page to see if the Ticket was created, and check if the Issue ID was stored in the database. Also, the alarm should be marked as acknowledged automatically.
From JIRA, close the ticket.
Wait until the alarm is updated automatically.
Verify that the ticket state in the database is marked as closed.
Send nodeRagainedService event and wait until the automation marks the trigger alarm as cleared.
Unacknowledge the alarm to speed up the removal process.
Wait until the alarm is removed from the database.

About the environment

I added the following properties to enable the JIRA Integration:

Note that "alarmTroubleTicketLinkTemplate" is treated differently depending on the version, but as that is just informational, that would work for the mentioned tests.

In terms of the JIRA Plugin, I have the following on jira.properties

I hide the credentials to protect the JIRA instance, as it is public.

Also, I used the send-event.pl script to generate the nodeLostService and nodeRegainedService required for the test cases.

Finally, I reduced the time from 15 minutes to 1 minute on the Vacuumd/Drools rules associated with automatic ticket creation, closing, and update to speed up the process. However, the use cases mention the default value of 15 minutes.

The workflow I used for each meridian version was the following:

Start a VM with a clean installation of the Meridian version to test (which contains the RPM for the JIRA plugin). Keep in mind that the version of Java and PostgreSQL is different in some cases.
Configure the solution to enable the JIRA Trouble Ticket Integration.
Start OpenNMS.
Install the Karaf feature (jira-troubleticket).
Perform Use Case 1.
Verify the results.
Wait until the alarms table is empty.
Perform Use Case 2 (use a different service if the alarms table is not empty or you don't want to wait).
Verify the results.
Wait until the alarms table is empty. If that doesn't happen, stop OpenNMS, truncate the table and then start OpenNMS.
Reduce the times as mentioned and send the reloadDaemonScript to enable the automatic behavior.
Perform Use Case 3.
Verify the results.
Wait until the alarms table is empty.
Perform Use Case 4 (use a different service if the alarms table is not empty or you don't want to wait).
Verify the results.

Results

I will mention only what I found in M2019 and M2020, as everything worked as expected with M2016, M2017, and M2018. Later I'll share what I had to change in the older versions to have the solution working (although irrelevant for the discussion, worth mentioning).

Use Case 1

In M2019 and M2020, I can create the ticket after hitting the "Create Ticket" button from the alarm page. However, the behavior when I want to close the ticket is different.

In M2019, if I click on "Close Ticket", the ticket is closed (verified on JIRA), but the cleared alarm with a ticket with a CLOSED state in the database is never removed from the database, even if it should.

In M2020, if I click on "Close Ticket", the ticket state is never updated, and the ticket remains open (no changes in JIRA), as well as the state of the alarm in the database.

Use Case 2

In both M2019 and M2020, I can create the ticket as mentioned in Use Case 1. Still, when I click on "Update Ticket" after closing the JIRA ticket, the status gets updated in the alarms, but as mentioned for Use Case 1, the cleared alarm with the CLOSED ticket state is never removed, which didn't happen with the older versions.

I'm not sure if it was due to some race condition, but if I alter the steps' order and send the nodeRegainedService, the cleared alarm is sometimes removed before closing the ticket on JIRA. However, when it happens, it takes more time than what would happen if the alarm doesn't have a ticket associated.

Use Case 3

In both M2019 and M2020, the ticket is created automatically, but when the resolving event comes in, the alarm gets cleared, but the ticket is not closed automatically in JIRA; therefore, the state remains open and the cleared alarm is never removed.

Use Case 4

Interestingly, this is the only use case that works on all versions, including M2019 and M2020, especially considering the results for the Use Case 3 for these two versions.

Configuration fixes on M2016 and M2017

The solution only worked if I create a Trust Store for the JVM with the JIRA server's certificates, which was not required for M2018, M2019, and M2020.

Configuration fixes on M2018

The SQL associated with selectClosedTicketStateForProblemAlarms was wrong in vacuumd-configuration.xml. It currently has:

But, there are two errors there, and the correct SQL should be:

I understand that we don't apply configuration changes in Meridian, but in my opinion, having a broken configuration should be an exception to that rule.

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Chandra Gorantla March 23, 2021 at 9:30 PM

PR: https://github.com/OpenNMS/opennms/pull/3351

Alejandro Galue March 16, 2021 at 2:08 PM

I added M2019 and M2020 to the Fix Version, as I added proof that the Ticketing Integration is broken on those versions. In fact, Meridian is what I used for the analysis.