Latest 1.9 snapshot deletes services
Description
Environment
Acceptance / Success Criteria
Lucidchart Diagrams
Activity

Tarus Balog January 31, 2012 at 1:59 PM
Okay, I was wrong - capsd wasn't disabled.
This problem turns out to be a race condition between provisiond and capsd. If provisiond is handling newSuspect events, it will use a foreign source to test for services. However, since nodes created this way won't belong to a provisioning group, capsd tries to manage them on a restart.
In this case the "Update" service detector was added to the default foreign source, but no configuration existed for the "Update" service in capsd-configuration.xml. Since the service didn't exist, capsd deleted it.
This problem will go away entirely once capsd is fully deprecated. In the meantime, the workaround is to either disable capsd or include all services in the capsd-configuration file.

Tarus Balog January 30, 2012 at 12:33 PM
Hrm - not sure what's going on. Ben backed out a suspect change and made new RPMs. Prior to the restart:
opennms=# select nodeid,serviceid from ifservices where status='D';
nodeid | serviceid
--------+-----------
(0 rows)
After the restart:
opennms=# select nodeid,serviceid from ifservices where status='D';
nodeid | serviceid
--------+-----------
36 | 29
20 | 29
21 | 29
15 | 29
40 | 29
39 | 29
16 | 29
35 | 29
16 | 29
14 | 29
9 | 29
37 | 29
13 | 29
38 | 29
10 | 29
(15 rows)

Tarus Balog January 26, 2012 at 2:51 PMEdited
Node: barbrady.internal.opennms.com
Not a member of any provisioning requisition
provisiond is handling the newSuspect events. It seems to be fine for devices that are in a provisioning group.
Note that this is new behavior in the last week or so.

Benjamin Reed January 26, 2012 at 12:15 PM
Are the nodes provisioned, or capsd-scanned?
I upgraded to the latest snapshot this morning. When I restarted, I lost about seven outages. I have configured a custom service called Update that uses the SnmpMonitor to check the status of an OID. If that OID is not zero, the service is marked as down.
For the devices with the outage, the service was deleted when OpenNMS was started. This would be bad.
From the database:
opennms=# select nodeid,serviceid from ifservices where status='D';
nodeid | serviceid
--------+----------- 36 | 29
15 | 29
16 | 29
16 | 29
9 | 29
13 | 29
10 | 29
(7 rows)
These are the nodes in question (service 29 is the Update service). If I run:
opennms=# update ifservices set status='A' where status='D';
UPDATE 7
and then restart, the services are deleted again. No logs of interest that I can see.
Again, this is on barbrady in the office. I won't mess with it in hopes that someone can take a look at it soon.