The PSM is reporting bad values for response times
Description
Environment
Acceptance / Success Criteria
Attachments
Lucidchart Diagrams
Activity
Alexander Hoogerhuis April 2, 2013 at 6:22 PM
This:
http://masterjavacode8.appspot.com/question/5080f28e4f1eba38a4cc6e2a
and this:
http://stackoverflow.com/questions/510462/is-system-nanotime-completely-useless
and tangentially you might be up this creek:
http://www.java.net/node/694103
Might shed some light on things. So seeing how they select clocks, and you say you are on a VM, then have a look at the output of dmesg to see what clock source is available in that VM. I know some early versions of 2.6.x did not play well on ESX 3.5 and 4.0 wrt to time keeping. Also, if you are on VMWare, don't enable the time sync between host and guest and rely on ntp instead.
Alejandro Galue April 2, 2013 at 3:27 PM
Alex, thanks!
I'll do that to see if that is the source of the problem.
Alexander Hoogerhuis April 2, 2013 at 3:13 PM
First thing that strikes me: Sun Java 1.6.0_17
Try bumping that to Sun Java 1.6.0_43?
Alejandro Galue April 1, 2013 at 3:19 PMEdited
Take a look at the following code:
On a perfect world, if you execute the above code, you should not see anything on the console.
But, on a VM with CentOS 5.9 x86_64, I can see lots of messages, so it seems that System.nanoTime() is not reliable on that OS.
I've also tried the same code on CentOS 6.4 x86_64 and MacOS X Leopard Snow and it works fine (i.e, no output on the console).
The system that is experiencing this problem has the following details:
OpenNMS 1.10.6
Sun Java 1.6.0_17
Linux 2.6.18-164.11.1.0.1.el5 (amd64)
Based on the latest line, this is an older CentOS5 or RHEL5 system.
In order to test your system, create a file called Test.java with the above content. Then compile it by executing:
To execute the test:
If the program is displaying messages on your console, your Linux has a problem.
Alejandro Galue March 22, 2013 at 10:00 AM
I've attached the RRD with the wrong data generated and updated by the PSM. I'm not including the PSM configuration to protect the private information of the customer, and because it is irrelevant for this issue.
Also, I've attached a sample graph, and a script to generate the graph (with basically the same information from the graph template used in OpenNMS).
The second column of the graph is displaying the wrong data; I mean, the response time for "Login to the system" is greater than the response time of the total operation (displayed as a black line).
Here is the raw data extracted from the RRD using "rrdtool fetch":
Check out the second row (the row with 1363003200 for the timespan). The second column is greater than the last column; and the last column must be equal (or as equal as possible) as the sum of all other columns, because it is tracking the response time of the whole operation.
So, the problem might be related with the object used to track the time on each operation. Maybe there was a timeout on an individual test and that was not properly taken in consideration (but I'm guessing here).
This is related with customer ticket 2013:
https://mynms.opennms.com/Ticket/Display.html?id=2013
The customer wants to test an ordering system, by processing 12 web pages. It is required to store the individual response time from each page, as well as the total response time of the whole test.
The problem is that, in some cases (i.e., not always), the response time for certain pages is greater than the response time of the whole operation which is wrong by definition. This should not happen and that's the reason of this bug.
The consequences are that the graph is showing unexpected information with weird values (more on this later).