Pollerd attempts to create duplicate outages

Description

Pollerd threads can fail with the following exception, even with the fix for https://opennms.atlassian.net/browse/NMS-7519#icft=NMS-7519 applied:

Exception in thread "Poller-Thread-347-of-500" Exception in thread "Poller-Thread-379-of-500" org.springframework.dao.DataIntegrityViolationException: could not insert: [org.opennms.netmgt.model.OnmsOutage]; SQL [insert into outages (ifLostService, ifRegainedService, ifserviceId, svcLostEventId, svcRegainedEventId, suppressTime, suppressedBy, outageId) values (?, ?, ?, ?, ?, ?, ?, ?)]; constraint [one_outstanding_outage_per_service_idx]; nested exception is org.hibernate.exception.ConstraintViolationException: could not insert: [org.opennms.netmgt.model.OnmsOutage]
at org.springframework.orm.hibernate3.SessionFactoryUtils.convertHibernateAccessException(SessionFactoryUtils.java:643)
at org.springframework.orm.hibernate3.HibernateAccessor.convertHibernateAccessException(HibernateAccessor.java:412)
at org.springframework.orm.hibernate3.HibernateTemplate.doExecute(HibernateTemplate.java:412)
at org.springframework.orm.hibernate3.HibernateTemplate.executeWithNativeSession(HibernateTemplate.java:375)
at org.springframework.orm.hibernate3.HibernateTemplate.saveOrUpdate(HibernateTemplate.java:738)
at org.opennms.netmgt.dao.hibernate.AbstractDaoHibernate.saveOrUpdate(AbstractDaoHibernate.java:410)
at org.opennms.netmgt.poller.QueryManagerDaoImpl.openOutagePendingLostEventId(QueryManagerDaoImpl.java:116)
at org.opennms.netmgt.poller.DefaultPollContext.openOutage(DefaultPollContext.java:303)
at org.opennms.netmgt.poller.pollables.PollableService.createOutage(PollableService.java:272)

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Jesse White March 25, 2015 at 2:32 PM

Fixed in f2a2ac87b3f7d25affddfc8123653632376c1fb7.

Jesse White March 25, 2015 at 2:08 PM

The issue can occur when outage records are not populated with an svcLostEvent - this may happen in the system is restarted after the outage record is created, but before the event is received back from the event bus.

On restart, pollerd assumes the service is Up, despite there being an outstanding outage record - and if/when this service goes offline, another outage is created.

Fixed

Details

Assignee

Reporter

Fix versions

Affects versions

Priority

PagerDuty

Created March 25, 2015 at 2:05 PM
Updated May 11, 2015 at 2:49 PM
Resolved March 25, 2015 at 2:32 PM