Skip to main content

Don't (necessarily) trust your tools

Posted by sdo on January 22, 2008 at 9:08 AM PST

I spent last week working with a customer in Phoenix (only a few weeks before the Giants go there to beat the Patriots), and one of the things we wanted to test was how their application would work with the new in-memory replication feature of the appserver. They brought along one of their apps, we installed it and used their jmeter test, and quickly verified that the in-memory session replication worked as expected in the face of a server failure.



Feeling confident about the functionality test, we did some performance testing using their jmeter script. We got quite good throughput from their test. But as we watched it run, we noticed jmeter reporting that the throughput kept continually decreasing. Since we were pulling the plug on instances in our 6-node cluster all the time, at first I just chalked it up to that. But then we ran a test without failing instances, and the same thing happened: continually decreasing performance.



Nothing is quite as embarrassing as showing off your product to a customer and having the product behave badly. I was ready to blame a host of things: botched installation, network interference, phases of the moon. Secretly, I was willing to blame the customer app: if there's a bug, it must be in their code, not ours.



Eventually, we simplified the test down to a single instance, no failover, and a single URL to a simple JSP: pretty basic stuff, and yet it still showed degradation over time (in fact, things got worse). Now there were two things left to blame: jmeter, or the phases of the moon. Neither seemed likely, until I took a closer look at what jmeter was doing: it turns on that the jmeter script was using an Aggregate Report. That report, in addition to updating the throughput for each request, also updates various statistics, including the 90% response time. It does this in real-time, which may seem like a good idea: but the problem is that calculating the 90% response time is an O(n) operation: the more requests jmeter made, the longer it took to calculate the 90% time.



I've previously written in other contexts about why tests with 0 think time are subject to misleading results. And it turns out this is another case of that: because there is no think time in the jmeter script, the time to calculate the 90% penalizes the total throughput. As the time to calculate the 90% increases, the time available for jmeter to make requests decreases, and hence the reported throughput decreases over time.



I'm not actually sure if jmeter is smart enough to do this calculation correctly even if there is think time between requests: will it just blindly sleep for the think time, or will it correctly calculate the think time minus its own processing time? For my test, it doesn't matter: the simpler thing is to use a different reporting tool that doesn't have the 90% calculation (which, I'm happy to report, showed glassfish/SJSAS 9.1 performing quite well with in-memory replication across the cluster and no degradation over time).



But what's more important to me is that it reinforces a lesson that I seem to have to relearn a lot: sometimes, your intuition is smarter than your tools. I had a strong intuition from the beginning that the test was flawed, but despite that, we spent a fair amount of time tracking down possible bugs in glassfish or the servlets.



And I also don't mean to limit this to a discussion of this particular bug/design issue with jmeter. When we tested startup for the appserver, a particular engineer was convinced that glassfish was idle for most of its startup time: the UNIX time command reported that the elapsed time to run asadmin start-domain was 30 seconds, but the CPU time used was only 1 or 2 seconds. The conclusion from that was that glassfish sat idle for 28 seconds. But intuitively, we knew that wasn't true (for one thing, the disk was cranking away all that time, and a quick glance at a CPU meter would disprove the theory that the CPU wasn't being used). And of course, it turns out that asadmin was starting processes which started processes, and shell timing code didn't understand all the descendant structure (particularly when intermediate processes exited but the grandchild process -- the appserver -- was still executing). The time command was just not suited to giving the desired answer.



Tools that give you visibility into your applications are invaluable; I'm not suggesting that when a tool gives you a result that you don't expect that you should blindly cling to your hypothesis anyway. But when a tool and your intuition are in conflict, don't be afraid to examine the possibility that the tool isn't measuring what you wanted it to.

Related Topics >>

Comments

Oh, duh....yes, I might have mentioned that the other reporting tool in jmeter is the summary report.

Care to indicate what the other reporting tool was?

Good post and good to see that GlassFish performed well. As a note, your link to the "zero think time" blog post is broken.

Well, I'd much have preferred faban :-) But the people I was working with were familiar with jmeter and had already written the jmeter jmx files, and so that's what we used. As I came to the realization that the testing tool was the problem, I was prepared to switch to faban to prove that to myself (another good point: try multiple tools when data doesn't make sense), but then discovered the issue with jmeter and hence moved on.

hi scott,

since you've earlier used faban, I'm wondering why you didn't consider using it in this assignment ?

thank you,

BR,
~A

Actually tools can help you determine whether you are measuring want you wanted you just need to reason with the observations and explore possibilities using the tool as a guide. This reminds me of a recent posting I made regarding a benchmark between 2 popular open source frameworks that was flawed on many counts

Benchmark Analysis: Guice vs Spring

regards,
William