Incorrect Node Availability in Reports
Description
Environment
Acceptance / Success Criteria
Attachments
depends on
Lucidchart Diagrams
Activity

Seth Leger June 29, 2016 at 8:57 AM
I've looked at the pgdump that is attached to the issue and the outages in there appear to correspond to the information in the PDF report.
If there are any inaccuracies here, they were probably due to the issues fixed in . I'm going to mark this as cannot reproduce.

John Mellor June 5, 2012 at 3:07 PM
Yet another data point: Yesterday and part of today, I had a host completely (i.e. no power) down for roughly 27 hours, waiting for a replacement disk. Polling is at the default 5-minute intervals. Looking at the failed node, I see the correct downtime, so the database values are very close to correct. However, in the 7-day availability report, this should have meant 27/168 hours (or 16%) downtime in the 7-day report, but instead, I see a report showing only 1.12% downtime for the failed server, or 1/14th of the correct availability answer.

John Mellor May 29, 2012 at 5:44 PM
pg_dump

John Mellor May 23, 2012 at 11:39 AM
Hi Donald; I think the biggest problem with the report is that I expect to see massive amounts of red in the daily bar graphs in the various categories for the 2.3-day outage. Instead, I hardly see any red at all. Is it picking a best-case set of services to report on instead of what I'm expecting? Most servers and VMs were not even pingable for the duration of the outage.

Donald Desloge May 23, 2012 at 10:50 AM
The report is separated by surveillance category. Looking at the report you attached if I scroll down to the "Production" or "Servers" category I see a lot of servers listed with a 30% overall-down and a 52 hour outage.
Summary for category: Servers
Average 1.00 50.92 50.92 30.307 69.693
Maximum 1.00 51.51 51.51 30.663 71.052
Minimum 1.00 48.63 48.63 28.948 69.337
With the following servers as the top 25%
Node Percent Down
Top 25 Percent
Server Richese ESXi 30.66%
Server Vernius ESXi 30.66%
Server Ix ESXi 30.66%
Server Atreides ESXi 30.66%
Server Corrino ESXi 30.66%
Server Office ESXi 29.90%
Server Services ESXi 28.95%
Details
Assignee
UnassignedUnassignedReporter
John MellorJohn MellorComponents
Affects versions
Priority
Major
Details
Details
Assignee
Reporter

Components
Affects versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

I have a site consisting of 2 CentOS VMs and one virtual IP address on 2 ESXi machines, functioning as a failover pair. These VMs and ESXi hosts all expose ping, http, https and one or two other services. The virtual IP is an alternate address for the services being presented, and therefore should mirror the available services of the active host in the failover pair.
For 5 of the previous 7 days, one of the machines was down, waiting for parts. This should have resulted in a 40%*5/7=28% host-down situation in reports, with all services offline on the failed machine.
Instead, I see a 7-day node availability report showing only 1.37 hours downtime, and 99.182% up, or clearly wildly incorrect.