Incorrect Node Availability in Reports

Description

I have a site consisting of 2 CentOS VMs and one virtual IP address on 2 ESXi machines, functioning as a failover pair. These VMs and ESXi hosts all expose ping, http, https and one or two other services. The virtual IP is an alternate address for the services being presented, and therefore should mirror the available services of the active host in the failover pair.
For 5 of the previous 7 days, one of the machines was down, waiting for parts. This should have resulted in a 40%*5/7=28% host-down situation in reports, with all services offline on the failed machine.
Instead, I see a 7-day node availability report showing only 1.37 hours downtime, and 99.182% up, or clearly wildly incorrect.

Environment

CentOS 5.8, fully updated

Acceptance / Success Criteria

None

Attachments

Linked issues

depends on

NMS-7475

Problems with "NodeAvailability" report

Lucidchart Diagrams

Activity

Show:

Seth Leger June 29, 2016 at 8:57 AM

I've looked at the pgdump that is attached to the issue and the outages in there appear to correspond to the information in the PDF report.

If there are any inaccuracies here, they were probably due to the issues fixed in . I'm going to mark this as cannot reproduce.

John Mellor June 5, 2012 at 3:07 PM

Yet another data point: Yesterday and part of today, I had a host completely (i.e. no power) down for roughly 27 hours, waiting for a replacement disk. Polling is at the default 5-minute intervals. Looking at the failed node, I see the correct downtime, so the database values are very close to correct. However, in the 7-day availability report, this should have meant 27/168 hours (or 16%) downtime in the 7-day report, but instead, I see a report showing only 1.12% downtime for the failed server, or 1/14th of the correct availability answer.

John Mellor May 29, 2012 at 5:44 PM

pg_dump

John Mellor May 23, 2012 at 11:39 AM

Hi Donald; I think the biggest problem with the report is that I expect to see massive amounts of red in the daily bar graphs in the various categories for the 2.3-day outage. Instead, I hardly see any red at all. Is it picking a best-case set of services to report on instead of what I'm expecting? Most servers and VMs were not even pingable for the duration of the outage.

Donald Desloge May 23, 2012 at 10:50 AM

The report is separated by surveillance category. Looking at the report you attached if I scroll down to the "Production" or "Servers" category I see a lot of servers listed with a 30% overall-down and a 52 hour outage.

Summary for category: Servers

Average 1.00 50.92 50.92 30.307 69.693
Maximum 1.00 51.51 51.51 30.663 71.052
Minimum 1.00 48.63 48.63 28.948 69.337

With the following servers as the top 25%

Node Percent Down
Top 25 Percent
Server Richese ESXi 30.66%
Server Vernius ESXi 30.66%
Server Ix ESXi 30.66%
Server Atreides ESXi 30.66%
Server Corrino ESXi 30.66%
Server Office ESXi 29.90%
Server Services ESXi 28.95%

Cannot Reproduce

Details
Assignee
Unassigned
Reporter
John Mellor
Components
Affects versions
1.10.1
Priority
Major

PagerDuty

Created May 18, 2012 at 9:56 AM

Updated June 29, 2016 at 8:57 AM

Resolved June 29, 2016 at 8:57 AM

Incorrect Node Availability in Reports

Description

Environment

Acceptance / Success Criteria

Attachments

Linked issues

depends on

Lucidchart Diagrams

Activity

Seth Leger June 29, 2016 at 8:57 AM

John Mellor June 5, 2012 at 3:07 PM

John Mellor May 29, 2012 at 5:44 PM

John Mellor May 23, 2012 at 11:39 AM

Donald Desloge May 23, 2012 at 10:50 AM

DetailsAssigneeUnassignedUnassignedReporterJohn MellorJohn MellorComponentsAffects versions1.10.1PriorityMajor

Details

Assignee

Reporter

Components

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Details
Assignee
Unassigned
Reporter
John Mellor
Components
Affects versions
1.10.1
Priority
Major

PagerDuty