<?xml version="1.0" encoding="utf-8"?>
<feed version="0.3" xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en">
<title>Scott Oaks&apos;s Blog</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/" />
<modified>2008-05-03T17:37:05Z</modified>
<tagline></tagline>
<id>tag:weblogs.java.net,2008:/blog/sdo/289</id>
<generator url="http://www.movabletype.org/" version="3.01D">Movable Type</generator>
<copyright>Copyright (c) 2008, sdo</copyright>
<entry>
<title>More on the simple vs. the complex</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2008/04/more_on_the_sim.html" />
<modified>2008-05-03T17:37:05Z</modified>
<issued>2008-04-03T17:04:04Z</issued>
<id>tag:weblogs.java.net,2008:/blog/sdo/289.9470</id>
<created>2008-04-03T17:04:04Z</created>
<summary type="text/plain">Is the speed of an appserver on a single request indicative of how it will handle your traffic?</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[<a href="http://weblogs.java.net/blog/sdo/archive/2008/04/what_does_it_me.html">Yesterday,</a> I wrote that I'm often asked which X is faster (for a variety of
X). The answer to that question always depends on your perspective. I answered that question in terms of hardware and concluded (as I always do) that the answer depends very much on your needs, but that a machine which appears slower in a single-threaded test will likely be faster in a multi-threaded world. You can't necessarily extrapolate results from a simple test to a complex system.
<br><br>
What about software? Today, I'll look at that question in terms of NIO. You're probably aware that the connection handler of glassfish is based on <a href="http://grizzly.dev.java.net">Grizzly,</a> an NIO framework. Yet in recent weeks, we've read claims from the Mailinator author that <a href="http://mailinator.blogspot.com/2008/02/kill-myth-please-nio-is-not-faster-than.html">
traditional I/O is faster than NIO.</a> And a recent
<a href="http://weblogs.java.net/blog/jdcampbell/archive/2008/02/top_java_5_ee_s_1.html">
blog</a>
from Jonathan Campbell shows traditional I/O-based appserver outperforming glassfish. So what gives?
<br><br>
Let's look more closely at the test Jonathan Campbell ran: even though it simulates multiple clients, the driver runs only a single request at a single time. It doesn't appear so on the surface, this is exactly an NIO issue; it has to do with how you architect servers to handle single-request streams vs. a conversational stream. A little know fact about glassfish is that is still contains a blocking, traditional I/O-based connector which is based on the Coyote connector from
Tomcat. It is enabled that in glassfish (adding the -Dcom.sun.enterprise.web.connector.useCoyoteConnector=true option to your jvm-options) -- but read this whole blog before you decide that using that connector is a good thing.
<br> <br>
So I enabled this connector, got out my two-CPU Linux machine running Red Hat AS 3.0, and re-ran the benchmark Jonathan ran on glassfish and jBoss (I tried Geronimo, but when it didn't work for me, I abandonned it -- I'm sure I'd just done
something stupidly wrong in running it, but I didn't have the time to look into
it). I ran each appserver with the same JVM options, but did no other tuning.
And now that we're comparing the blocking, traditional I/O connectors, Glassfish comes out well on top (and, by comparison with Jonathan's numbers, it would easily have beat Geronimo as well):<br>
<img alt="jbench.png" src="http://weblogs.java.net/blog/sdo/archive/images/jbench.png" width="700" height="523" />
<br><br>
So does this mean that traditional I/O is faster than NIO? For this test, yes, But in general? Not necessarily. So next, I wrote up a little <a href="faban.sunsource.net">Faban Driver</a> that uses the same war file as the original test, but Faban will run the clients simultaneously instead of sequentially and continually pound on the same sessions. In my Faban test, I ran 100 clients, each of which had a 50 ms think time between repeated calls to the session validation servlet of the test. This gave me these calls per second:
<ul>
<li>Glassfish with NIO (grizzly): 8192
<li>Glassfish with Std IO: 3344
<li>jBoss: 6953
</ul>
Yes, those calls per second are vastly higher than the original benchmark -- the jRealBench driver is able to drive the CPU usage of my appserver machine only to about 15%. Faban can do better, though since the test is dominated by network
traffic, the CPU utilization is still only about 70%. And for glassfish's blocking connector, I had to increase the request-processing thread count to 100 (even so, there's probably something wrong with that result, but since the blocking connector is not really what we recommend you use in production, I'm not going to delve into it).
<br><br>
When scalability matters, NIO is faster than traditional blocking I/O. Which, of course, is why we use grizzly as the connector architecture for glassfish (and
why you probably should NOT run out and change your configuration to use the coyote connector, unless your appserver usage pattern is very much dominated by single request/response patterns). The complex is different than the simple.
<br><br>
As always, your milage will vary -- but the point is, are there tests where traditional I/O is faster than NIO? Of course -- with NIO, you always have the overhead of a select() system call, so when you measure the individual path, traditional I/O will always be faster. But when you need to scale, then NIO will generally be faster; the overehead of the select() call is outweighed by having fewer thread context switches, or by having long keep-alive times, or other options that architecture opens. Just as we saw with hardware, you can't necessarily extrapolate performance from the single, simple case to the complex system: you must test it to see how it behaves.
]]>

</content>
</entry>
<entry>
<title>What does it mean to be faster?</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2008/04/what_does_it_me.html" />
<modified>2008-05-03T17:37:42Z</modified>
<issued>2008-04-01T18:57:16Z</issued>
<id>tag:weblogs.java.net,2008:/blog/sdo/289.9456</id>
<created>2008-04-01T18:57:16Z</created>
<summary type="text/plain">If a machine does a simple test faster than machine B, is machine A the faster machine for your needs?</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[As a performance engineer, I'm often asked which X is faster (for a variety of X). The answer to that question always depends on your perspective.
<br><br>
Today, I'll talk about the answer in terms of hardware and application servers.
People quite often measure the performace of their appserver on, say, their laptop and a 6-core, 24-thread <a href="http://shop.sun.com/is-bin/INTERSHOP.enfinity/WFS/Sun_NorthAmerica-Sun_Store_US-Site/en_US/-/USD/ViewStandardCatalog-Browse?CategoryName=Sun_Fire_T1000_Se_1&CategoryDomainName=Sun_NorthAmerica-Sun_Store_US-SunCatalog">Sun Fire T1000</a>
and are surprised that the cheaper laptop can serve single requests much faster
than the more expensive server.
<br><br>
There are technical reasons for this that I won't delve into -- there are <a href="http://www.sun.com/servers/coolthreads/coolthreads_architecture_wp.pdf">architecture guides</a> that go into all that. Rather I want to explore the question
of which of these machines is actually faster, particularly in a Java EE context. In an appserver, you typically want to process multiple requests at the same time. So looking at the speed of a single request isn't really interesting: what
is the speed of multiple requests?
<br><br>
To answer this, I took a simple program that does a long-running nonsense calculation. Running this on my laptop and 24-thread T1000, I see the following times
(in seconds) to calculate X items:
<table>
 <tr>
   <th># Items</th>
   <th>Laptop</th>
   <th>T1000</th>
 </tr>
 <tr>
   <td>1</td>
   <td>.66</td>
   <td>1.3</td>
 </tr>
 <tr>
   <td>2</td>
   <td>1.4</td>
   <td>1.5</td>
 </tr>
 <tr>
   <td>4</td>
   <td>2.8</td>
   <td>1.6</td>
 </tr>
 <tr>
   <td>8</td>
   <td>5.4</td>
   <td>2.5</td>
 </tr>
 <tr>
   <td>16</td>
   <td>10.8</td>
   <td>3.7</td>
 </tr>
 <tr>
   <td>24</td>
   <td>16.6</td>
   <td>4.8</td>
 </tr>
</table>
As you'd expect, the performance of the laptop degrades linearly, to where it takes 16.6 seconds to perform 24 calculations. The performance of the T1000 isn't
a linear scale, but even though it takes twice as as the laptop long to perform
a single calculation, it can perform 24 calculations in one-third of the time of the laptop.
<br><br>
In the context of an appserver, think of the calculation as the time required for the business methods of your app. I've walked through this explanation a number of times, and often I'm told that the business method is the critical part of
the app, and it must be done in .6 seconds for each user -- and hence the throughput of the T1000 isn't important. And that's fine: if you need to calculate a single method in .6 seconds, then you must use the single-threaded machine. But if you need to calculate two of those at the same time, then you'll need to get two of those machines, and if you need to calculate 24 of them, you'll need to get 24 machines.
<br><br>
So this brings us back to our question: which machine is faster? And it depends
on what you need. If you need to only do one calculation at a time, then the laptop is faster. If you need to do 3 or more calculations at the same time, then the T1000 is faster. Which is faster for you will depend on your application, your traffic model, and many other variables. As always, the best thing is to try your application, but if that's not feasible, be very careful about extrapolating whatever data you do have: you cannot simply extrapolate performance data from
a simple (single-threaded) model to a complex system.
]]>

</content>
</entry>
<entry>
<title>Oh, go ahead -- prematurely optimize</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2008/02/oh_go_ahead_pre.html" />
<modified>2008-05-03T17:38:12Z</modified>
<issued>2008-02-25T23:04:41Z</issued>
<id>tag:weblogs.java.net,2008:/blog/sdo/289.9266</id>
<created>2008-02-25T23:04:41Z</created>
<summary type="text/plain">Premature optimization is the root of all evil. Writing badly-performing code is even worse.</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[Recently, I've been reading an article entitled
<a href="http://www.acm.org/ubiquity/views/v7i24_fallacy.html">The Fallacy of Premature Optimization</a> by Randall Hyde. I urge everyone to go read the full article, but I can't help
summarizing some of it here -- it meshes so well with some of my conversations with developers
over the past few years.
<br><br>
Most people can quote the line "Premature optimization is the root of all evil" (which was
popularized by Donald Knuth, but originally comes from Tony Hoare). Unfortunately, I (and
apparently My. Hyde) come across too many developers who have taken this to mean that they
don't have to care about the performance of their code at all, or at least not until the code
is completed. This is just wrong.
<br><br>
To begin, the complete quote is actually
<pre>
We should forget about small efficiencies, say about 97% of the time: premature optimization
is the root of all evil.</pre>
I agree with the basic premise of what this says, and also with everything it does not say.
In particular, this quote is abused in three ways.
<br><br>
First, it is only talking about small efficiencies. If you're designing a multi-tier app
that uses the network alot, you want to pay attention to the number of network calls you
make and the data involved in them. Network calls are a <b>large</b> inefficiency. And
not to pick on network calls -- experienced developers know what things are inefficient,
and know to program them carefully from the start.
<br><br>
Second, Hoare is saying (and Hyde and I agree) that you can safely ignore the small
inefficiencies 97% of the time. That means that you should pay attention to small
inefficiencies 1 out of every 33 lines of code you write.
<br><br>
Third, and only somewhat relatedly, this quote builds into the perception that 80% of
the time an application spends will be in 20% of the code, so we don't have to worry about
our code's performance until we find out we're in the 80%.
<br><br>
I'll present one example from glassfish to highlight those last two points. One day, we
discovered that a particular test case for glassfish was bottlenecked on calls to Vector.size --
in particular, because of loops like this:
<pre>
Vector v;
for (int i = 0; i < v.size(); i++)
     process(v.get(i));
</pre>
This is a suboptimal way to process a vector, and one of the 3% of cases you need to pay
attention to. The key reason here is because of the synchronization around vector, which
turns out to be quite expensive when this loop is the hot loop in your program. I know,
you've been told that uncontended access to a synchronized block is almost free, but that's
also not quite true -- crossing a synchronization boundary means that the JVM must flush all
instance variables presently held in registers to main memory. The synchronization boundary
also prevents the JVM from performing certain optimzations, because it limits how the JVM
can re-order the code. So we got a big performance boost by re-writing this as
<pre>
ArrayList v;
for (int i = 0, j = v.size(); i < j; i++)
     process(v.get(i));
</pre>
Perhaps you're thinking that we needed to use a vector because of threading issues, but
look at that first loop again: it is not threadsafe. If this code is accessed by multiple
threads, then it's buggy in both cases.
<br><br>
What about that 80/20 rule? It's true that we found this case because it was consuming a lot
(not 80%, but still a lot) of time in our program. [Which also means that fixing this case
is tardy optimization, but there it is.]
But the problem is that there wasn't just
one loop written like this in the code; there were (and still are...sigh) hundreds. We
fixed the few that we the worst offenders, but there are still many, many places in the
code where this construct lives on. It's considered "too hard" to go change all the places
where this occurs (though NetBeans could refactor it all pretty quickly, but there's a
risk that subtle differences in the loop would mean that it would need to be refactored
differently).
<br><br>
When we addressed preformance in Glassfish V2 in order to get our <a href="http://weblogs.java.net/blog/sdo/archive/2007/07/sjsas_91_slassf_1.html">excellent SPECjAppServer results</a>,
we fixed a lot of little things like this, because we spend 80% of our time in about 50% of
our code. It's what I call performance death by a thousand cuts: it's great when you can
find a simple CPU-intensive set of code to optimize. But it's even better if developers
pay some attention to writing good, performant code at the outset and you don't have to
track down hundreds of small things to fix.
<br><br>
Hyde's <a href="http://www.acm.org/ubiquity/views/v7i24_fallacy.html">full article</a>
has some excellent references for further reading, as well as other important points about
why, in fact, paying attention to performance as you're developing is a necessary part of
coding.]]>

</content>
</entry>
<entry>
<title>Performance Stat of the Day</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2008/02/performance_sta.html" />
<modified>2008-02-04T03:12:46Z</modified>
<issued>2008-02-04T03:12:34Z</issued>
<id>tag:weblogs.java.net,2008:/blog/sdo/289.9125</id>
<created>2008-02-04T03:12:34Z</created>
<summary type="text/plain">It&apos;s impossible to tell performance without measuring</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
I&apos;ve written several times before about how you have to measure performance to understand how you&apos;re doing -- and so here&apos;s my favorite performance stat of the day: New York 17, New England 14.



</content>
</entry>
<entry>
<title>Don&apos;t (necessarily) trust your tools</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2008/01/dont_necessaril.html" />
<modified>2008-01-23T15:00:28Z</modified>
<issued>2008-01-22T17:08:24Z</issued>
<id>tag:weblogs.java.net,2008:/blog/sdo/289.9035</id>
<created>2008-01-22T17:08:24Z</created>
<summary type="text/plain">jmeter leads us down a blind alley -- should we have known better?</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[I spent last week working with a customer in Phoenix (only a few weeks before the Giants go there to beat the Patriots), and one of the things we wanted to test was how their application would work with the new in-memory replication feature of the appserver. They brought along one of their apps, we installed it and used their jmeter test, and quickly verified that the in-memory session replication worked as expected in the face of a server failure.
<br><br>
Feeling confident about the functionality test, we did some performance testing using their jmeter script. We got quite good throughput from their test. But as we watched it run, we noticed jmeter reporting that the throughput kept continually decreasing. Since we were pulling the plug on instances in our 6-node cluster all the time, at first I just chalked it up to that. But then we ran a test without failing instances, and the same thing happened: continually decreasing performance.
<br><br>
Nothing is quite as embarrassing as showing off your product to a customer and having the product behave badly. I was ready to blame a host of things: botched installation, network interference, phases of the moon. Secretly, I was willing to blame the customer app: if there's a bug, it must be in their code, not ours.
<br><br>
Eventually, we simplified the test down to a single instance, no failover, and a single URL to a simple JSP: pretty basic stuff, and yet it still showed degradation over time (in fact, things got worse). Now there were two things left to blame: jmeter, or the phases of the moon. Neither seemed likely, until I took a closer look at what jmeter was doing: it turns on that the jmeter script was using an Aggregate Report. That report, in addition to updating the throughput for each request, also updates various statistics, including the 90% response time. It does this in real-time, which may seem like a good idea: but the problem is that calculating the 90% response time is an O(n) operation: the more requests jmeter made, the longer it took to calculate the 90% time.
<br><br>
I've previously written in other contexts about why tests with 0 think time are <a href="http://weblogs.java.net/blog/sdo/archive/2007/05/how_to_test_con.html">subject to misleading results.</a> And it turns out this is another case of that: because there is no think time in the jmeter script, the time to calculate the 90% penalizes the total throughput. As the time to calculate the 90% increases, the time available for jmeter to make requests decreases, and hence the reported throughput decreases over time.
<br><br>
I'm not actually sure if jmeter is smart enough to do this calculation correctly even if there is think time between requests: will it just blindly sleep for the think time, or will it correctly calculate the think time minus its own processing time? For my test, it doesn't matter: the simpler thing is to use a different reporting tool that doesn't have the 90% calculation (which, I'm happy to report, showed glassfish/SJSAS 9.1 performing quite well with in-memory replication across the cluster and no degradation over time).
<br><br>
But what's more important to me is that it reinforces a lesson that I seem to have to relearn a lot: sometimes, your intuition is smarter than your tools. I had a strong intuition from the beginning that the  test was flawed, but despite that, we spent a fair amount of time tracking down possible bugs in glassfish or the servlets.
<br><br>
And I also don't mean to limit this to a discussion of this particular bug/design issue with jmeter. When we tested startup for the appserver, a particular engineer was convinced that glassfish was idle for most of its startup time: the UNIX time command reported that the elapsed time to run asadmin start-domain was 30 seconds, but the CPU time used was only 1 or 2 seconds. The conclusion from that was that glassfish sat idle for 28 seconds. But intuitively, we knew that wasn't true (for one thing, the disk was cranking away all that time, and a quick glance at a CPU meter would disprove the theory that the CPU wasn't being used). And of course, it turns out that asadmin was starting processes which started processes, and shell timing code didn't understand all the descendant structure (particularly when intermediate processes exited but the grandchild process -- the appserver -- was still executing). The time command was just not suited to giving the desired answer.
<br><br>
Tools that give you visibility into your applications are invaluable; I'm not suggesting that when a tool gives you a result that you don't expect that you should blindly cling to your hypothesis anyway. But when a tool and your intuition are in conflict, don't be afraid to examine the possibility that the tool isn't measuring what you wanted it to.]]>

</content>
</entry>
<entry>
<title>Grizzly Protocol Parsers</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2007/12/grizzly_protoco.html" />
<modified>2008-06-24T19:17:03Z</modified>
<issued>2007-12-19T21:01:04Z</issued>
<id>tag:weblogs.java.net,2007:/blog/sdo/289.8866</id>
<created>2007-12-19T21:01:04Z</created>
<summary type="text/plain">A quick overview of grizzly 1.7 new protocol parsing paradigm.</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Open Source</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[[NOTE: The code in this blog was revised 2/11/08 due to some errors on my part the first time, and some changes as it was ingtegrated into grizzly. And thanks to Erik Svensson for pointing out a few errors, it has been revised again on 2/13/08.]<br>
I'm quite interested these days in parsing performance: much of what a Java
appserver does is take bytes from a network stream (usually, but not
always, in some 8-bit encoding) and convert them into Java strings (based
on 16-bit characters). Because servlet and JSP APIs are written in terms of
strings, much of that conversion is unavoidable, but parsing network
protocols at the byte level is appropriate in some circumstances.
<br><br>
As I prepared to prototype some tests around that, I realized I needed a
good framework to test my changes, and of course that framework is
<a href="https://grizzly.dev.java.net">grizzly.</a>
In fact, the newly-released <a href="http://weblogs.java.net/blog/jfarcand/archive/2007/12/hohohoho_grizzl_1.html">
grizzly 1.7</a> has a new protocol parser that exactly fit my needs
(partly because I joined the grizzly project so that I could modify the
parser as I needed; such are the joys of open source!).
<br><br>
I'll talk about some of my performance tests with network parsing in later
blogs; for now, I wanted to write a quick entry on how to use grizzly 1.7's
new protocol parser. In grizzly 1.7, the ProtocolParser interface was
reimplemented to make it much easier to deal with the messages that the
parser is expected to produce. This means that it is now possible to use
standard grizzly filters to handle the data produced by a ProtocolParser,
simply like this:
<pre>
controller.setProtocolChainInstanceHandler(new DefaultProtocolChainInstanceHandler() {
      public ProtocolChain poll() {
          ProtocolChain protocolChain = protocolChains.poll();

          if (protocolChain == null) {
              protocolChain = new DefaultProtocolChain();
              ((DefaultProtocolChain) protocolChain).setContinuousExecution(true);
              protocolChain.addFilter(new MyParserProtocolFilter());
              protocolChain.addFilter(new MyProcessorFilter());
          }

          return protocolChain;
      }
}
</pre>
The nice thing about this is that additional filters (like a debugging log filter)
can be inserted anywhere along the chain; the protocol use is completely
integrated into the standard grizzly design. Note that call to setContinuousExecution -- it should be the default for protocol parsers (and will be eventually), but version 1.7 of grizzly will need that call. [Note that the standard LogFilter in grizzly is not appropriate in this case, since it tries to read directly from the socket as well; it's trivial to write your own if you like.]
<br><br>
Now it's a matter of implementing the two filters and the parser itself.
The ParserProtocolFilter class will handle reading the requests and calling
the parser, but in order for it to know which parser to use, you must
extend it and override the newProtocolParser method:
<pre>
public class MyParserProtocolFilter() {
    public ProtocolParser newProtocolParser() {
         return new MyProtocolParser());
    }
}
</pre>
What about the parser itself? That's the meat of the issue. The new
protocol parser interface expects a basic flow like this: start processing
a buffer, enumerate the message in the buffer, and end processing the
buffer. The buffer can contain 0 or more complete messages, and it's up to
the protocol parser to make sense of that. Here's the outline of
a simple protocol parser
that parses a protocol where the first byte is a number of bytes in string,
followed by the remaining bytes:
<pre>
public class MyProtocolParser<String> implements ProtocolParser {
    byte[] data;
    int position;
    ByteBuffer savedBuffer;
    int origLimit;
    public void startBuffer(ByteBuffer bb) {
        // We begin with a buffer containing data. Save the initial buffer
        // state information. The best thing here is to get the backing store
        // so that the bytes can be parsed directly. We also need to save the
        // original limit so that we can place the buffer in the correct state at the
        // end of parsing
            savedBuffer = bb;
            savedBuffer.flip();
            partial = false;
            origLimit = savedBuffer.limit();
            if (savedBuffer.hasArray()) {
                data = savedBuffer.array();
                position = savedBuffer.position() + savedBuffer.arrayOffset();
                limit = savedBuffer.limit() + savedBuffer.arrayOffset();
            } else ...maybe copy out the data, or use put/get when parsing...
    }

    public boolean hasMoreBytesToParse() {
        // Indicate if there is unparsed data in the buffer
        return position < limit;
    }

    public boolean isExpectingMoreData() {
        // If there is a partial message remaining in the buffer, return true
        return partial;
    }

    public String getNextMessage() {
        // We already know this, but other protocols might parse here
        return savedString;
    }

    public boolean hasNextMessage() {
        // In our case, it's easier to parse here
        int length = data[position];
        if (data.length < position + 1 + length) {
            savedString = new String(data, position + 1, length);
            position += length + 1;
            savedBuffer.limit(length + position + 1);
            savedBuffer.position(position + 1);
            partial = false;
        }
        else partial = true;
        return !partial;
    }

    public boolean releaseBuffer() {
        // If there's a partial message return true; else false
            if (!hasMoreBytesToParse())
                savedBuffer.clear();
            else {
                // You could compact the buffer here if you're
                // concerned that there isn't enough space for
                // further messages, but compacting comes at a
                // performance price -- whether to compact or not
                // depends on your protocol.
                savedBuffer.position(position);
                savedBuffer.limit(origLimit);
            }
            return partial;
    }
}
</pre>
The point of this is that the ParserProtocolFilter will repeatedly call
hasNextMessage/getNextMessage to retrieve messages (Strings in this case)
to pass to the next filter. When it's done, it will call releaseBuffer,
which is responsible for setting the position and limit in the buffer to
reflect the data consumed by the (possibly multiple) messages returned.
<br><br>
So what about the downstream filters? You probably noticed that when we
parsed the data, we also set the limit/position in the ByteBuffer to
reflect the message boundaries. That's because not all grizzly filters will
understand that the data is protocol based and has been seperated into
types. For instance, you could write a LogFilter that just prints out the data received;
it doesn't know about the messages (and we wouldn't want it to -- we'd want
it to print the raw data anyway, rather than information in the message).
<p>
But downstream filters can also understand what a message is and hence they
can work like this:
<pre>
public class MyProcessorFilter implements ProtocolFilter {
    public boolean execute(Context ctx) {
        String s = (String) ctx.getAttribute(ProtocolParser.MESSAGE);
        if (s == null) {
            // no message; just use the bytes in the buffer like a
            // normal filter
            s = getStringFromBuffer(ctx);
        }
        .. do something with s ..
    }
}
</pre>
So, apart from writing the protocol parser (which could be quite complex,
depending on the actual protocol and how it breaks into messages), using
the new grizzly framework for protocol parsing is quite simple: you just
set up the parser class, and then have a filter that processes the messages
from the parser. And long the way, you can use any other grizzly filter or
framework feature you need.
]]>

</content>
</entry>
<entry>
<title>A Glassfish Tuning Primer</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2007/12/a_glassfish_tun.html" />
<modified>2008-05-25T20:23:59Z</modified>
<issued>2007-12-03T19:25:33Z</issued>
<id>tag:weblogs.java.net,2007:/blog/sdo/289.8749</id>
<created>2007-12-03T19:25:33Z</created>
<summary type="text/plain">Everything (almost) you wanted to know about tuning glassfish without reading the manual.</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[When I reported our <a href="http://weblogs.java.net/blog/sdo/archive/2007/11/a_scalable_spec.html">recent excellent SPECjAppServer 2004 scores,</a> one glassfish user responded:
<pre>
I sure wish you guys were able to come up with a thorough write up
about the SPEC Benchmark architecture, and the techniques you guys
used to get the numbers you get and, more importantly, how those
techniques might apply to every day applications we run in the wild.
</pre>
While we do have a <a href="http://docs.sun.com/app/docs/doc/819-3681">full performance-tuning chapter</a> in the glassfish/SJSAS docset, I can understand the appeal of a quick cheat-sheet for getting the most out of glassfish in production. Most of this information has appeared in various blogs, particularly by <a href="http://weblogs.java.net/blog/jfarcand/">Jeanfrancois,</a> who is so expertly focused on making sure that grizzly and our http path is as fast as possible. Still, I hope that gathering this quick list together will be a good single-source summary.
<br><br>
One thing to note about these guidlines: a lot of glassfish configurations (particularly when you start with a developer profile) are optimized for developers. In development, performance is different: you'll trade off a few seconds here and there to make starting the appserver faster, or deploying something faster. In production, you'll make opposite trade-offs. So if you wonder why some of the things in this list aren't necessarily the default setting, that's probably why.
<h3>Tune your JVM</h3>
The first step is to tune the JVM, which is of course different for every deployment. These are the options set via the jvm-option tag in your domain.xml (or the JVM options page in the admin console). As a general rule, I like to use the throughput collector with large heaps and a moderate-sized young generations: that makes young GCs quite fast. That will lead to a periodic full GC, but the impact of that on total throughput is usually quite minimal. If you absolutely cannot tolerate a pause of a few seconds, you can look at the concurrent collector, but be aware that this will impact your total throughput. So a good set of JVM arguments to start with are:
<pre>
-server -Xmx3500m -Xms3500m -Xmn1500m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+AggressiveOpts</pre>
On a CMT machine like the SunFire T5220 server, you'll want to use large pages of 256m, and a heap that is a multiple of that:
<pre>
-server -XX:LargePageSizeInBytes=256m -Xmx2560m -Xms2560m -Xmn1024m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=16 -XX:+AggressiveOpts
</pre>
More details of the impact of a CMT machine are available at <a href="http://www.sun.com/coolthreads/tnb/applications.jsp">Sun's Cool Threads website.</a>
<br><br>
Make sure to remove the -client option from your jvm options, to include the -Dcom.sun.enterprise.server.ss.ASQuickStartup=false flag, and -- if you are using CMP 2.1 entity beans -- to include -DAllowMediatedWriteInDefaultFetchGroup=true.
<h3>Tune the default-web.xml</h3>
Settings in the default-web.xml file are overridden by an application's web.xml, but I find it easier to set production-ready values in the default-web.xml file so that all applications will get them. In particular, under the JspServlet definition, add these two parameters:
<pre>
&lt;init-param>
  &lt;param-name>development&lt;/param-name>
  &lt;param-value>false&lt;/param-value>
&lt;/init-param>
&lt;init-param>
  &lt;param-name>genStrAsCharArray&lt;/param-name>
  &lt;param-value>true&lt;/param-value>
&lt;/init-param>
</pre>
That will mean you cannot change JSP pages on your production server without redeploying the application, but that's generally what you want anyway.
<br><br>
On note about this: this file is only consulted when an application is deployed. So make sure you change the file and then deploy your application, or you won't see any benefit from this change.
<h3>Tune the HTTP threads</h3>
As you know, there are two parameters here: the HTTP acceptor threads, and the request-processing threads. These value have unfortunately had different meanings in a few of our releases, and some confusion about them remains. The acceptor threads are used to both to accept new connections to the server and to schedule existing connections when a new request comes over them. In general, you'll need 1 of these for every 1-4 cores on your machine; no more than that (unlike, say SJSAS 8.1 where this had a completely different meaning). The request threads run HTTP requests. You want "just enough" of those: enough to keep the machine busy, but not so many that they compete for CPU resources -- if they compete for CPU resources, then your throughput will suffer greatly. Too many request processing threads is often a big performance problem I see on many machines.
<br><br>
How many is "just enough"? It depends, of course -- in a case where HTTP requests don't use any external resource and are hence CPU bound, you want only as many HTTP request processing threads as you have CPUs on the machine. But if the HTTP request makes a database call (even indirectly, like by using a JPA entity), the request will block while waiting for the database, and you could profitably run another thread. So this takes some trial and error, but start with the same number of threads as you have CPU and increase them until you no longer see an improvement in throughput.
<h3>Tune your JDBC drivers</h3>
Speaking of databases, it's quite important in glassfish to use JDBC drivers that perform statement caching; this allows the appserver to reuse prepared statements and is a huge performance win. The JDBC drivers that come bundled with the Sun Java Systems Application Server provide such caching; Oracle's standard JDBC drivers do as well, as do recent drivers for Postgres and MySQL. Whichever driver you use, make sure to configure the properties to use statement caching when you set up the JDBC connection pool -- e.g., for Oracle's JDBC drivers, include the properties
<pre>
ImplicitCachingEnabled=true
MaxStatements=200
</pre>
<h3>Use the HTTP file cache</h3>
If you serve a lot of static content, make sure to enable the HTTP file cache.
<br><br><br><br>
Have I piqued your interest? As I mentioned, there are hundreds of pages of tuning guidelines in our docset. But here at least you have some important first steps.]]>

</content>
</entry>
<entry>
<title>A scalable SPECjAppServer 2004 submission</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2007/11/a_scalable_spec.html" />
<modified>2008-06-24T19:17:03Z</modified>
<issued>2007-11-26T17:16:45Z</issued>
<id>tag:weblogs.java.net,2007:/blog/sdo/289.8707</id>
<created>2007-11-26T17:16:45Z</created>
<summary type="text/plain">Sun has submitted a SPECjAppServer 2004 submission that scales across a lot of hardware. Is it just a question of throwing hardware at the problem?</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[Last week, Sun published a new SPECjAppServer 2004 benchmark score: 8439.36 JOPS@Standard [1]. [I'd have written about it sooner, but it was published late Wednesday, and I had to go home and bake a lot of pies.] This is a "big" number, and frankly, it's the one thing that's been missing in our repertoire of submissions. We'd previously shown leading performance on a single chip, but workloads in general (and SPECjAppServer 2004 in particular) don't scale linearly as you increase the load. This number shows that we can scale our appserver across multiple nodes and machines quite well.
<br><br>
I've been asked quite a lot about what scalability actually means for this workload, so let me talk about Java EE scalability for a little bit. The first question I'm invariably asked is, isn't this just a case of throwing lots more hardware at the problem? Clearly, at a certain level the answer is yes: you can't do more work without more hardware. And I don't want to minimize the importance of the amount of hardware that you throw at the problem. There are presently two published SPECjAppServer scores that are higher than ours: HP/Oracle have results of 9459.19 JOPS@Standard [2] and 10519.43 JOPS@Standard [3]. Yet those results require 11 and 12 (respectively) appserver tier machines; our result uses only 6 appserver tier machines. More telling is that the database machine in our submission is a pretty beefy Sun Fire E6900 with 24 CPUs and 96GB of memory. Pretty beefy, that is, until you look at the HP/Oracle submissions that rely on 40 CPUs and 327GB of memory in two Superdome chasis. So yes, if you have millions (and I mean many millions -- ask your HP rep how much those two Superdomes will cost) of dollars to throw at the hardware, you can expect to get a quite high number on the benchmark.
<br><br>
The database, in fact, is one reason why most Java EE benchmarks (and workloads) will not scale linearly -- you can horizontally scale appserver tiers pretty well, but there is still only a single database that must handle an increasing load.
<br><br>
On the appserver side, horizontal scaling is not quite just a matter of throwing more hardware at the problem. SPECjAppServer 2004 is partitioned quite nicely: no failover between J2EE instances is required, connections to a particular instance are sticky, and the instances don't need to communicate with each other. All of that leads to quite nice linear scaling.
<br><br>
But one part of the benchmark doesn't scale linearly, because it is dependent on the size of the database. SPECjAppServer 2004 uses a bigger database for bigger configurations. For example, our previous submission on a single SunFire T2000 achieved a score of 883.66 JOPS@Standard [4]. The benchmark sizing rules meant that the database used for that configuration was only 10% as large at the database we used in our current submission. [More reason why that database scaling is important.] And in particular, it meant that the database in the small submission held 6000 items in the O_item table while our current submission had 60000 items in that table.
<br><br>
For SPECjAppServer 2004, that's important because the benchmark allows the appserver to cache that particular data in ead-only, container-managed EJB 2.1 entities. [That's a feature that's explicitly outside of the J2EE 1.3/1.4 specification, so your portable J2EE apps won't use it -- your portable Java EE 5 apps that use JPA can use cached database data, though somewhat differently.] Caching 6K items is something a single instance can do, but caching all 60K items will cause GC issues for the appserver. Hence, in some areas, the appserver will have to do more work as the database size increases, even if the total load per appserver instance is the same.
<br><br>
So a "big" score on this benchmark is a factor of two things: there are things within the appserver architecture that influence how well you will scale, even in a well-partitioned app. But the amount of hardware (and cost of that hardware) remains the key driving factor in just how high that score can go. As I've stressed many times, benchmarks like this are a proof-point: our previous numbers establish that we have quite excellent performance, and this number establishes that we can scale quite well. As always, the only relevant test remains your application: download the appserver now and see how well it responds to your requirements.
<br><br>
Finally, as always, some disclosures: SPEC and the benchmark name SPECjAppServer 2004 are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of 11/26/07. For the latest SPECjAppServer 2004 benchmark results, visit http://www.spec.org/. Referenced scores:<br>
[1] Six Sun SPARC Enterprise T5120 (6 chip, 48 cores) appservers and one Sun Fire E6900 (24 chips, 48 cores) database; 8,439.36 JOPS@Standard<br>
[2] Eleven HP BL860c (22 chips, 44 cores) appservers and two HP Superdomes (40 chips, 80 cores) database; 9,459.19 JOPS@Standard<br>
[3] Twelve HP BL860c (24 chips, 48 cores) appservers and two HP Superdomes (40 chips, 80 cores) database; 10,519.43 JOPS@Standard<br>
[4] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database; 883.66 SPECjAppServer2004 JOPS@Standard]]>

</content>
</entry>
<entry>
<title>Sun Ships Glassfish V2</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2007/09/sun_ships_glass.html" />
<modified>2008-05-25T20:25:10Z</modified>
<issued>2007-09-17T19:59:21Z</issued>
<id>tag:weblogs.java.net,2007:/blog/sdo/289.8257</id>
<created>2007-09-17T19:59:21Z</created>
<summary type="text/plain">Sun ships glassfish V2, and the performance team no longer has two heads.</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[You've probably read by now, today Sun released the product version of <a href=http://blogs.sun.com/theaquarium/entry/glassfish_v2_launch_roundup">Glassfish V2</a>, or the Sun Java Systems Application Server 9.1. So it's time to look back a little, but also to look forward.
<br><br>
When we planned for this release of software, I told our engineering managers that we wanted to make sure that this release, finally, had superior performance. My group received a lot of support for this position, though not a few engineering managers looked at me as if I had two heads. But as you know, we achieved superior <a href="http://weblogs.java.net/blog/sdo/archive/2007/07/sjsas_91_glassf_1.html">SPECjAppServer 2004</a> scores with this software, so I am once again viewed as a mono-headed creature.
<br><br>
Of course, we focused on many, many areas of performance in this release of our software, which is something we don't always talk about. Part of that is logistical: if I tell you that we improved cluster startup and deployment by 20% (and we did), maybe you'll take my word for it. And perhaps you'll believe me that our new <a href="http://blogs.sun.com/memrep"
in-memory failover</a> is much, much faster than HADB. <a href="http://weblogs.java.net">Jeanfrancois</a> has continually blogged about the many performance enhancements we've made to Grizzly. We've even published <a href="http://wstest.dev.java.net/">WStest</a>, a web services benchmark to demonstrate the improvements we've made in our web services stack.
<br><br>
But since those aren't industry-standard, peer-reviewed benchmarks, how much of them will you believe? And frankly, how much will you believe an industry-standard, peer-reviewed benchmark? Really, the only way to tell is to
<a href="http://www.sun.com/software/products/appsrvr/index.xml">
download glassfish V2</a> yourself and run some rigorous tests on it. And see for yourself all the improvments we've made.
<br><br>
Of course, we've already started work on glassfish V3, where we'll be targeting even more performance features, including very rapid startup, particularly for web container developers.]]>

</content>
</entry>
<entry>
<title>SJSAS 9.1 (Glassfish V2) posts new SPECjAppServer 2004 result</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2007/07/sjsas_91_glassf_1.html" />
<modified>2008-06-24T19:17:03Z</modified>
<issued>2007-07-10T15:27:47Z</issued>
<id>tag:weblogs.java.net,2007:/blog/sdo/289.7824</id>
<created>2007-07-10T15:27:47Z</created>
<summary type="text/plain">Glassfish V1 was a price performance leader with good enough performance. Good enough is no longer enough.</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[Today, Sun officially announced SPECjAppServer 2004 scores on our Sun Java
Application Server 9.1, which (as you no doubt know) is the productized
version of the open-source Glassfish V2 project. We've previously submitted
results for SJSAS 9.0 (aka Glassfish V1), which at the time we were quite
proud of: they were the only SPECjAppServer scores based on an open-source
application server, and that gave us a <a href="http://weblogs.java.net/blog/sdo/archive/2006/12/index.html">quite good price/performance story.</a>
Considering where we started, I was happy to conclude that those scores were
"good enough."
<br><br>
"Good enough" is no longer good enough. Today, we posted the highest ever
score for SPECjAppServer 2004 on a single Sun Fire T2000 application server:
883.66 JOPS@Standard [1]. The Sun Fire T2000 in this case has a 1.4ghz CPU; the
application also uses a Sun Fire T2000 running at 1.0ghz for its database tier.
This result is 10% higher
than WebLogic's score of 801.70 JOPS@Standard [2] on the same appserver machine.
In addition, this result is almost 70% higher than our previous score of 521.42
JOPS@Standard on a Sun Fire T2000 [3], although that Sun Fire T2000 was running at only 1.2ghz. So
that doesn't mean that we are 70% faster than we were, but we are quite
substantially faster and are quite pleased to have the highest ever score
on the Sun Fire T2000.
<br><br>
This result is personally gratifying to me in many ways, and I am proud
of it (and proud of the work by the appserver engineers that it represents)
on many, many levels. But it is just a benchmark, so let me touch on two
things that means.
<br><br>
First, vendors and their marketing department love to play leap-frog games
with benchmarks.
My favorite example of this:
some time ago, BEA posted a score of 615.64 JOPS@Standard [4] on the 1.2ghz T2000,
only to be outdone a few months later by IBM WebSphere's score of 616.22
JOPS@Standard [5] on the same system. It's good marketing press, but at some
point those sort of differences become slightly ridiculous to end users.
<br><br>
So yes, at some point it's conceivable that someone will post a higher
score on this machine than we have; it's conceivable that I'll be back touting
some improvements on our score (because my protestations about benchmarks
aside, I'm not above playing the game either). But don't let any of that keep
you from the point: this is a result that fundamentally changes the nature
of that game.
We used to be content with having a good result in terms of price/performance
and watching IBM, Oracle, and
BEA leap-frog among themselves in terms of raw performance. Now, we're
the raw performance leader. There will be jockeying for position in the
future, but we've changed forever the set of contenders.  [We're also still
quite interested in being price/performance leaders, by the way, which is
why we also <a href="http://blogs.sun.com/tomdaly/entry/sun_pushing_price_performance_curve">published a score this week</a> using the free, open-source Postgres
database.]
<br><br>
Second, remember that this is just a benchmark. Will you see similar results
on your application? It depends. SPECjAppServer 2004 doesn't use EJB 3.0, JPA,
WebServices, JSF, or any of a host of Java EE technologies (and frankly,
I'm pretty happy with our performance in most of those areas; see, for example
<a href="http://java.sun.com/developer/technicalArticles/WebServices/high_performance/">this article</a> or <a href="http://weblogs.java.net/blog/kohsuke/archive/2007/02/jaxws_ri_21_ben.html">this one</a> on our WebServices performance). On the other
hand, its performance is significantly affected by improvements we made to
read-only EJBs, remote
EJB invocation, and co-located JMS consumers and producers. So some of the
improvements we've made may be in areas your application doesn't even use.
[That's another reason I was happy with our previous scores: they established
us as a viable appserver vendor, and I knew that customers who benchmarked
their own applications would likely see better relative performance than
that displayed by SPECjAppServer.]
<br><br>
Don't get me wrong: we have also made substantial performance improvements
across
the board: in the servlet connector and container, in JSP processing,
in the local EJB container,
in connection pooling, in CMP 2.1, and so on. This is really an important
performance release for us. But as I always have said: the
only realistic benchmark for your
environment is your application. So go grab <a href="https://glassfish.dev.java.net/public/alldownloads.html#Promoted_binary_builds">a recent build of glassfish V2</a>,
and see for yourself.<br><br>
<p>Now, as always, some disclosures:
SPEC and the benchmark name SPECjAppServer 2004 are registered trademarks of the Standard Performance Evaluation Corporation.
Competitive benchmark results stated above reflect results published
on www.spec.org as of 07/10/06. The comparison presented is based on
application servers run on the Sun Fire T2000 1.2 ghz and 1.4ghz servers.
For the latest SPECjAppServer 2004 benchmark results, visit http://www.spec.org/. Referenced scores:<br>
[1] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database; 883.66 SPECjAppServer2004 JOPS@Standard<br>
[2] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database; 801.70 SPECjAppServer2004 JOPS@Standard<br>
[3] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database ; 521.42 SPECjAppServer2004 JOPS@Standard<br>
[4] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire V490 (4 chips, 8 cores, 2 cores/chip) database; 615.64 SPECjAppServer2004 JOPS@Standard<br>
[5] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire X4200 (2 chips, 4 cores, 2 cores/chip) database; 616.22 SPECjAppServer2004 JOPS@Standard<br>]]>

</content>
</entry>
<entry>
<title>Switching tracks</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2007/07/switching_track.html" />
<modified>2008-06-24T19:17:03Z</modified>
<issued>2007-07-09T22:06:46Z</issued>
<id>tag:weblogs.java.net,2007:/blog/sdo/289.7817</id>
<created>2007-07-09T22:06:46Z</created>
<summary type="text/plain">Java has two switch statements -- should you actually care?</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[One of those lesser-known features of Java is that it contains two
different bytecodes for switch statements: a generic switch statement,
and an (allegedly more optimal) table-driven switch statement. The
compiler will automatically generate one or the other of these
statements depending on the values in the switch statement: the
table-driven statement is used when the switch values are close to
being sequential (possibly with a few gaps), where the generic
statement is used in all other cases. It's the sort of thing that
intrigues performance-oriented developers: is the table-driven
statement really more optimal? Is it worth coercing the variable
involved in a switch statement so that the compiler can generate a
table-driven statement? Is there ever a case in a real-world program
where this would even matter? Interesting questions, but since I
assumed the answer to the last one was "no", I never really thought
about the first few.<br>
<br>
Now, however, I'm looking at some profiles of Glassfish V2, and I find
that when running a particular application, we're spending a full 1% of
our time in this method:<br>
<pre>protected java.util.logging.Level convertLevel(int level) {<br>    int index = level / 100;<br>    switch (index) {<br>        case 3: return Level.FINEST;<br>        case 4: return Level.FINER;<br>        case 5: return Level.FINE;<br>        case 7: return Level.CONFIG;<br>        case 8: return Level.INFO;<br>        case 9: return Level.WARNING;<br>        case 10: return Level.SEVERE;<br>        default: return Level.FINER;<br>    }<br>}<br></pre>
Seems like a pretty simple method to be spending so much time in (and
let's face it, sampling profilers may overstate their time for a method
like this). So I dug in a little further. The level value passed to
this method is always exactly divisible by 100: it's not the case that
level can be 300, 305, and 310. So there is a one-to-one correspondence
between the integers passed to the method and the Level object
returned. So I was rather impressed that the original author of this
code had known enough arcane Java trivia to know that he could coerce
the argument to get the table-driven switch statement.<br>
<br>
Alas, if only he'd taken the next step to see if the performance
difference was worthwhile. It turns out that it wasn't: removing the
division from this method and recasting the switch statement to values
of 300, 400, and so on eliminted all the time the profiler attributed
to this method and resulted in a .5% improvement in the way the
application ran. I also did some quick micro-benchmarking of the method
and discovered that if I didn't need to coerce the argument into the
switch statement (that is, if I passed in values of 3, 4, 5, etc. to
begin with), the perfomance of the method was essentially the same, but
adding the division statement to coerce the argument slowed down
execution of the method quite significantly.<br>
<br>
At .5% of performance, I'm not sure that this is the real-world example
of where this would ever matter -- though when you provide a platform
for other people's applications, you worry about your operations being
as streamlined as possible. But it is another example of why you should
test you code before making assumptions about how it will perform, and
particularly before writing code to work around a potential performance
issue.<br>]]>

</content>
</entry>
<entry>
<title>Dynamically sizing threadpools</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2007/06/dynamically_siz.html" />
<modified>2008-06-24T19:17:03Z</modified>
<issued>2007-06-07T18:15:51Z</issued>
<id>tag:weblogs.java.net,2007:/blog/sdo/289.7588</id>
<created>2007-06-07T18:15:51Z</created>
<summary type="text/plain">Thread pools can typically be dynamically resized, but is that a feature you should take advantage of? In a word -- no.</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[Almost every thread pool implementation takes great pains to make sure
that it can dynamically resize the number of threads it utilizes: you
specify the mininum number of threads you want, the maximum number, and
the thread pool in its wisdom will automatically configure itself to
have the optimal number of threads for your workload. At least, that's
the theory...<br>
<br>
But what about in practice? I'd argue that its utility is very limited,
and that in many cases, a dynamically-resizing threadpool will actually
harm to the performance of your system.<br>
<br>
First, a quick review of why we have threadpools. From a perfomance
perspective, the most important task of a threadpool is to throttle the
number of simulatneous tasks running on your system. I know that you
may think that the purpose of a threadpool is to allow you to
conveniently run multiple things at once. It does that, but more
importantly, it prevents you from running too many things at once. If
you need to run 100 CPU-bound tasks on a machine with 4 CPUs, you will
get optimal throughput if you run only 4 tasks at a time: each task
fully utilizes the CPU while it is running. Since you can't run more
that 4 tasks at once, you won't get get any better throughput by having
more threads -- in fact, if you add more threads to the saturated
system, your throughput will go down: the threads will compete with
each other for CPU and other system resources, and the operating system
will spend more time than necessary managing the competing threads.<br>
<br>
In the real world, of course, tasks are never 100% CPU-bound, so you'll
usually want more threads than CPUs to get optimal use of your system.
How many more is a function of your workload:&nbsp; how much time it
waits for external resources like a database, and so on. But there will
be an optimal number, usually quite less than the number of
simultaneous tasks your can handle (particularly if those tasks
represent jobs coming in from remote users -- e.g. a web or application
server handling thousands of connections). The determining rule is
this: is you have more tasks to perform AND you have idle CPU time,
then it makes sense to add more threads to the pool. If you have more
tasks to perform but no idle CPU time, then it is counter-productive to
add threads to the pool. And that's my problem with dynamically
resizing threadpools: if they choose to add threads because there are
tasks waiting (even though there is no available CPU time), they will
hurt your performance rather than help it.<br>
<br>
Conceivably, you could use some native code to figure out the idle CPU
time on your system and have a threadpool that takes that information
into account. That would be better, but even that is insufficient. Say
you have an application server accessing a remote database using JPA.
Now if the database becomes a bottleneck, you'll have idle CPU time on
your application server, and it will have tasks that are waiting. But
adding threads to run those tasks will again make things worse: it will
increase the work needed to be done by the already-saturated database,
and your overall throughput will suffer. In the final analysis, you are
the only one that will have all the necessary information to know if it
is productive to increase the size of your thread pool.<br>
<br>
So you are responsible for setting the maximum size of the threadpool
to a reasonable value, so that the system will never attempt to run too
many threads at once. Given you've done that, is there a point in
having a mininum number of threads? The claim is that there is, because
it can save on system resources. But I would argue that the impact of
that is really minimal. Each thread has a stack and so consumes a
certain amount of memory. But if the thread is idle and the machine
doesn't have enough physical memory to handle everything on the system,
that idle memory will simply be paged out to virtual memory. Even if
the thread exits, the memory it used for its stack still belongs to the
JVM process -- the JVM might reuse that memory for something else, but
in general, the memory cannot be returned to the operating system for
use by other processes. So the memory issue doesn't really have much
impact. Depending on the application, it's conceivable that fewer idle
threads may have a small impact because when a thread is reused, it
might happen to have some important data in the CPU cache (whereas an
idle thread selected to run a task won't have any data in the CPU
cache), but the effects of that in the real world are pretty much
non-existent. So it doesn't hurt to have a minimum number of threads,
but you get no real advantage from that either.<br>
<br>
One area that can be very subtle in this regard is the
ThreadPoolExecutor, which can be configured to have three values: a
minimum, a core value, and an absolute maximum. In general, threads are
added when tasks are waiting until the system runs the desired core
value of threads. Then everything chugs along nicely, even though a
certain number of tasks may be waiting in the queue. Now say that the
system can't keep up with the tasks queue: the task queue length grows
beyond some defined value. In response to this, the executor will start
adding threads (up to the absolute maximum). But if the system is
CPU-bound, or if the system is causing a bottleneck on an external
resource, adding those threads is exactly the wrong thing to do. And
because this happens only under circumstances such as an increased
load, it might be something that you fail to catch in normal testing:
during normal testing, you'll usually run with the core number of
threads and may not even notice that you've misconfigured the maximum
number of threads to a value the system cannot handle. The converse of
this argument is that the thread pool executor can add new threads when
a burst of traffic comes, and as long as there are resources available
to execute those threads, the executor can handle the additional tasks
(and then, once the burst is over, the extra threads can exit and
reduce system resource usage). But given the minimal-at-best effect
that has on system resources, handling a burst like that doesn't make a
lot of sense to me, particularly given the potential for increasing
load on the system at exactly the wrong time.<br>
<br>
All of that is why I always choose to ignore dynamically sizing
threadpools, and just configure all my pools with a static size.<br>]]>

</content>
</entry>
<entry>
<title>How to test container scalability</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2007/05/how_to_test_con.html" />
<modified>2008-06-24T19:17:03Z</modified>
<issued>2007-05-02T19:38:25Z</issued>
<id>tag:weblogs.java.net,2007:/blog/sdo/289.7211</id>
<created>2007-05-02T19:38:25Z</created>
<summary type="text/plain">NIO can easily scale to thousands of users, but how do you accurately test if you&apos;re measuring 16,000 users?</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[Recently, I've been asked a lot about Covalent Technologies report that
<a href="http://blog.covalent.net/roller/covalent/entry/20070308">Tomcat
6 can scale to 16,000 users</a> and what that means for glassfish.
Since glassfish can easily scale to 16,000 users as well (as <a
 href="http://people.apache.org/%7Efhanik/tomcat/glassfish-reconf.txt">Covalent
found out</a> once they <a
href="http://weblogs.java.net/blog/jfarcand/archive/2007/01/configuring_gri.html">properly
configured glassfish</a>), my reply has usually been accompanied by a
shrug: we've known for quite some time that NIO scales well. <br>
<br>
But what does it mean to scale to N number of users, where N is large?
The answer is highly dependent on your benchmark, and in particular to
the think time that your benchmark uses. It's very easy to scale to
16,000 users if they each make a request every 90 seconds: that's on
the order of 180 requests/second. On the other hand, if there's no
think time in the equation, then continually handling 16,000 requests
is quite difficult, particularly on small machines. Closely related to
this is the response time of your requests: handling 16,000 requests
with an average response time of 10 seconds isn't particularly helpful
to your end users. But the most difficult aspect in scaling to 16,000
users is finding sufficient client horsepower to make sure that the
clients themselves aren't the bottleneck. Otherwise, any conclusions
you draw about the throughput or performance of the server are simply
wrong: the conclusions apply to the performance of the clients. So in
this blog, I'll explore how some of the considerations you need to
examine in order to benchmark a large system property.<br>
<br>
I've written before about <a
 href="http://weblogs.java.net/blog/sdo/archive/2007/03/index.html">why
the Apache Benchmark can't handle this situation</a> (surprisingly
enough, I'd been ranting against ab long before Covalent published
their benchmark; it's just fortuitous timing that they brought ab's
failings to light at the same time I was fed up with questions about ab
benchmarks from my colleagues). So for the tests I'll describe here, I
used Faban's new Common Driver. I've also previously written about how <a
 href="http://weblogs.java.net/blog/sdo/archive/2007/04/index.html">Faban
is a great, configurable benchmarking</a> framework, but the new common
driver is a simple, command-line program that can benchmark requests of
a single URL. I ran the tests on a partitioned SunFire T2000. This
particular machines has 24 logical CPUs (6 cores with 4 hardware
threads each, but for our purposes, simply 24 CPUs), which I partioned
into a server set of 4 CPUs and a client set of 20 CPUs. Yes, it takes
20 CPUs to drive some of the tests I ran, and so for consistency, I
kept that configuration for all of them. But it's a crucial point: if
the client is a bottleneck, you're measuring the client performance,
not the server performance. Using a set of processors on a single
machine allowed me to run the tests bypassing the network, which also
removes a potential bottleneck from measuring the server performance.
Given that there are only 4 CPUs for the server, I configured all
containers to use 2 acceptor threads and 20 worker threads, and
otherwise followed Sun's and Covalent's blog entries on configuring the
containers.<br>
<br>
I started with a simple test:<br>
<pre>java -d64 -classpath $JAVA_HOME/lib/tools.jar:fabancommon.jar:fabandriver.jar \
   -Xmx3500m -Xms3500m com.sun.faban.driver.cd -c 30 http://localhost/tomcat.gif<br></pre>
<p>This runs 30 separate clients (each in its own thread), each of
which continually requests tomcat.gif with no think time. You'll notice
we're using a 64-bit JVM for the test; eventually we'll be creating
16000 threads, which will require more than 4GB of address space. So to
make it easier for me, I used that JVM for all my tests. Have I
mentioned that driving a big client load requires a lot of resources so
that the client doesn't become the bottleneck?<br>
</p>
<p>The common driver reports three pieces of information: the number of
requests served per second, the average response time per request, and
the 90th percentile for requests: 90% of requests were served with that
particular response time or less. It will also report the number of
errors observed and some error conditions I'll discuss a little later.
I varied this test for different numbers of clients to see these
results:<br>
</p>
<pre>
# Users        Glassfish       Tomcat
  30          7552.9/0.004    7614.6/0.003
 100         10004.6/0.009    7680.4/0.013
1000         12434.7/0.079    6880.3/0.145
5000          8942.7/0.534    7589.0/0.654
</pre>
The results here are operations per second and the average response
time. I'd assume that I've misconfigured Tomcat's file cache here, but
the point isn't to make a comparison between the products absolute
performance; rather it is to explore issues around scalable
benchmarking. For static content, we get decent scaling, though at some
point there's enough requests so that the throughput of the server
suffers: just what we would expect. So what about a dynamic test? Here
are
some numbers from surfing to http://localhost/Ping/PingServlet -- which
is just a simple servlet that prints out 4 html strings and returns.<br>
<pre>
# Users        Glassfish       Tomcat
   30          5033.3/0.005   7154.0/0.004
  100          6359.5/0.015   7459.5/0.013
 1000          7411.2/0.134   6483.2/0.154
 5000          6060.1/0.818   6976.5/0.712
16000          6144.3/2.544   5263.0/2.375
</pre>
Here the numbers are fairly close. At the low end, glassfish pays a
penalty for being a full Java EE container, which requires it to do
some additional work for the simple servlet. [Though the fact that the
glassfish ops/sec increases so much with more users is an indication
that there's probably some bottleneck we could fix in the code at 30
users; hmm...a performance engineer's work is never done.] That result
at 5000 users? I'll discuss it later, but it's an anomaly. But first:
what about 16,000 connections? In addition to producing low throughput,
the tomcat run also reported: <br>
<pre>ERROR: Little's law failed verification: 16000 users requested; 13092.3455
users simulated.<br></pre>
In essence: almost 20% of the connections weren't serviced as expected
(glassfish reported a similar error).
I could repeat the test, and sometimes it would pass; sometimes it
would fail. But I'm clearly at the limit here of the hardware and
software. In this scenario, most of the errors are timeout errors on
connection: the server is to saturated in this test to accept new
connections. Note that that wouldn't happen with something like ab,
because ab's single-threaded nature inherently introduces an arbitrary
(and unmeasured) amount of think time into the equation. The amount of
think time is crucial, in that it drastically reduces the load on the
server; and an arbitrary amount think time is fatal, because we no longer know
what we're measuring.<br>
<br>
To test this scenario properly, we introduce a <em>deterministic</em> think time
into the driver by including a <b>-W 2000</b> parameter, which says each
client should have a 2 second (2000 ms) think time between requests. Now for
16,000 users, each server gave me these results:<br>
<pre>
                    Glassfish       Tomcat
ops/second           6988.9          6615.3
Avg. resp time        0.242           0.358
Max resp time         1.519           3.693
90% resp time           0.6            0.75
</pre>
Now both containers are handling the 16000 users, but the data we get
regarding throughput and response time is valid.<br>
<br>
Back to that result at 5000 users. The other interesting output from
the Faban common driver for the glassfish result was:<br>
<pre>ERROR: Think time deviation too high; request 0; actual is 1.0<br></pre>
<p>Or in the case of tomcat, the actual was 6.0 (accounting for their
better score) -- but the point is, although we didn't want think time
on the client, the client had some bottleneck that didn't allow it to keep up and
hence the benchmark result suffered. In effect, we ended up
benchmarking the client again, having yet again introduced an arbitrary,
non-deterministic think time. So even for 5000 users, we need to use
some think time to get an accurate assessment of the server behavior.
And so here are the results at 5000 users with a 500 millisecond think
time:<br>
</p>
<pre>
                     Glassfish       Tomcat
ops/sec               7607.25         7224.1
Avg. resp time          0.149          0.182
Max resp time           0.737          2.626
90% resp time            0.25           0.25
</pre>
So does this any of this mean that glassfish is better than tomcat? For
some applications, probably. For others, probably not. The real point
to take away from this is an understanding of how important it is to
understand what you're measuring when you measure performance. The
tests I've run are much too simple to draw any conclusions from: the
only realistic benchmark is your own application. But hopefully, now
you have a better understanding of how to approach large-scale testing
your own
application.<br>
<br>
The Common Driver for Faban is brand new code, so it hasn't yet been integrated into Faban's build schedule -- in fact, there is an issue with how it handles POST requests, which is what is delaying its integration. For now, you can download the <a href="http://home.nyc.rr.com/twopks/faban/fabancommon.jar">fabancommon.jar</a> and <a href="http://home.nyc.rr.com/twopks/faban/fabandriver.jar">fabandriver.jar</a> files I used for testing. If you find any problems with it (other than trying a POST request), be sure to let me know!]]>

</content>
</entry>
<entry>
<title>Simple Benchmarking with Faban</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2007/04/simple_benchmar.html" />
<modified>2008-06-24T19:17:03Z</modified>
<issued>2007-04-19T18:52:21Z</issued>
<id>tag:weblogs.java.net,2007:/blog/sdo/289.7097</id>
<created>2007-04-19T18:52:21Z</created>
<summary type="text/plain">Faban is a quite powerful benchmarking framework, but it can be used as a simple alternative to the flawed ab program.</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[A few weeks ago, I wrote about the <a
 href="http://weblogs.java.net/blog/sdo/archive/2007/03/ab_considered_h.html">shortcomings
of ab</a>, the Apache Benchmark program. It's major shortcoming is that
it is single-threaded, making it very difficult to measure accurately a
multi-threaded application or web server. At the time, I promised to
blog soon about an alternative open-source load generator called <a
 href="http://faban.sunsource.net/">Faban</a>. Real-life intervened,
but here is the promised blog about using Faban as an alternative to ab.<br>
<br>
Faban is actually a quite sophisticated benchmarking framework. It is
designed to automate running benchmarks from a set of clients, which is
of course quite useful when trying to generate load on an application
server. As such, it has a very nice GUI where you can define and submit
jobs, co-ordinate driver hosts, and manage all the tasks necessary for
large-scale benchmarking. Faban also allows you to write your own
load-generating driver, so that the interaction with your application
can be quite sophisticated: you can look through the returned HTML and
figure out what links to follow next; you can define interactions via
IIOP or other protocols; you can have a variety of interaction models
based on think time or cycle times; and so on. For HTTP, its timings
are as accurate as possible because it interposes at the
lowest-available socket level in the JVM and takes its timing
measurements there.<br>
<br>
However, Faban also ships with a standard HTTP driver that allows
moderately complex operations (so you don't need to write a driver) and
a set of classes that allow you to run clients from simple scripts.
Because I'm a simple, command-line oriented kind-of-guy, that's the way
I typically use Faban, and that's the easiest way to use it as a
replacement for ab. So that's what I'll discuss here. But I urge you to
check out Faban's full set of features; it's really a very powerful
tool. And it's interesting to know that Faban will form the basis of
the driver used by SPEC in their next version of their appserver
benchmark.<br>
<br>
So, for a better ab-like test, what do we need? First, download the
latest nightly build of the faban client jars. Then you will need to
write two things: a run.xml file that defines how you want to run the
benchmark, and&nbsp; a quick script to start the benchmark.<br>
<br>
The run.xml file defines how a benchmark should be run. The faban
documentation lists a lot of options for that file, but the simplest
case for our purposes is this:<br>
<pre>&lt;?xml version="1.0" encoding="UTF-8"&gt;<br>  &lt;runConfig&gt;<br>    &lt;runControl unit="time"&gt;<br>      &lt;rampUp&gt;300&lt;/rampUp&gt;     --&gt; defines a 300 second rampup time<br>      &lt;steadyState&gt;300&lt;/steadyState&gt;  --&gt; defines a 300 second measurement cycle<br>      &lt;rampDown&gt;120&lt;/rampDown&gt; --&gt; defines a 120 second rampdown time<br>    &lt;/runControl&gt;<br>    &lt;benchmarkDefinition&gt;<br>      &lt;name&gt;my_test&lt;/name&gt;   --&gt; used for the run sequence file<br>      &lt;metric&gt;ops/sec&lt;/metric&gt;<br>    &lt;/benchmarkDefinition&gt;<br>    &lt;outputDir&gt;<br>      /path/to/an/existing/directory   --&gt; results will go in this directory<br>    &lt;/outputDir&gt;<br>    &lt;driverConfig name="http_driver1"&gt;<br>      &lt;threads&gt;64&lt;/threads&gt;    --&gt; number of clients to run, each in its own thread<br>      &lt;requestLagTime&gt;         --&gt; defines think time for each client<br>        &lt;uniform&gt;<br>          &lt;cyclteType&gt;thinktime&lt;cycleType&gt;<br>          &lt;cycleMin&gt;0&lt;cycleMin&gt;<br>          &lt;cycleMax&gt;0&lt;cycleMax&gt;  --&gt; in this case, 0 think time between requests<br>        &lt;/uniform&gt;<br>      &lt;/requestLagTime&gt;<br>      &lt;operation&gt;<br>        &lt;name&gt;getTest&lt;/name&gt;   --&gt; you can define multiple operations; give each a unique name<br>        &lt;url&gt;http://host:port&lt;/url&gt;<br>        &lt;get&gt;&lt;![CDATA[/index.html]]&gt;&lt;/get&gt;   --&gt; could be post, and an arbitrary URL<br>        &lt;max90th&gt;.1&lt;/max90th&gt;        --&gt; 90% of responses must be received in this time period<br>      &lt;/operation&gt;<br>      &lt;operationmix&gt;               --&gt; Defines how many of each multiple operation is executed<br>        &lt;name&gt;getTest&lt;/name&gt;<br>        &lt;r&gt;1&lt;/r&gt;                   --&gt; 100% of operations are getTest<br>      &lt;/operationmix&gt;<br>    &lt;/driverConfig&gt;<br>  &lt;/runConfig&gt;<br></pre>
So this file will execute calls to http://host:port/index.html from 64
threads with 0 think time between requests. You can get an idea from
this how faban allows you to build more complex tests of your
appserver. Next, you'll need a script to run the benchmark:<br>
<pre>#!/bin/sh<br>CLASSPATH=/path_to_faban/fabanagents.jar:/path_to_faban/fabancommon.jar:/path_to_faban/fabandriver.jar:$JAVA_HOME/lib/tools.jar<br>export CLASSPATH<br>java -Djava.security.policy=policy_file com.sun.faban.common.RegistryImpl &amp; pid=$!<br>sleep 3<br>java $JVM_OPTIONS -Dbenchmark.config=run.xml com.sun.faban.driver.core.MasterImpl<br>kill $pid<br></pre>
When you run this, a directory containing the results of the run is
created (based on the outputDir stanza); the directories are numbered
consecutively (the last value of that number is in $HOME/my_test.seq --
the run sequence file). There's a lot of data there, but the key file
is summary.xml, which will contain the metric and whether the test
passed or failed according to the parameters (desired cycle times, 90th
percentile times, and so on) defined in your run.xml. Those lines look
like this:<br>
<pre>    &lt;metric unit="ops/sec"&gt;9061.7&lt;/metric&gt;<br>    &lt;passed&gt;true&lt;/passed&gt;<br></pre>
So my test got 9061 requests per second. There's a nice XSTL
transformation available with faban if you want to see the entire set
of data.<br>
<br>
Because its a relatively new project, the power of Faban is somewhat
exposed in the complexity to run it -- at least for the time being.
There's no reason why a simple Java program couldn't take command-line
argument and produce the necessary scripts to run Faban in this simple
mode. If real life doesn't intrude, I'll work on that in the next few
weeks. But don't wait for me: join the Faban project and contribute
yourself!<br>
</body>
</html>]]>

</content>
</entry>
<entry>
<title>ab considered harmful</title>
<link rel="alternate" type="text/html" href="http://weblogs.java.net/blog/sdo/archive/2007/03/ab_considered_h.html" />
<modified>2008-06-24T19:17:03Z</modified>
<issued>2007-03-23T23:09:27Z</issued>
<id>tag:weblogs.java.net,2007:/blog/sdo/289.6903</id>
<created>2007-03-23T23:09:27Z</created>
<summary type="text/plain">ab is popular as a tool to measure appserver performance, but it is clearly the wrong tool for the job.</summary>
<author>
<name>sdo</name>

<email>Scott.Oaks@Sun.COM</email>
</author>
<dc:subject>Performance</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://weblogs.java.net/blog/sdo/">
<![CDATA[<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html; charset=ISO-8859-1"
 http-equiv="content-type">
  <title>blpg</title>
</head>
<body>
For the fifth time this year, I've been contacted by a distraught user
claiming that glassfish doesn't scale or run well based on results seen
from ab (the <a
 href="http://httpd.apache.org/docs/2.0/programs/ab.html">Apache
Benchmark</a>). And so again, I've had to explain why ab is a terrible
tool to use to measure the performance of your application (or web)
server.<br>
<br>
To be fair, glassfish does have some out-of-the-box settings that make
its benchmark test results less than ideal. Jeanfrancois has <a
 href="http://weblogs.java.net/blog/jfarcand/archive/2007/03/configuring_gri_2.html">this</a>
excellent blog that describes the basic settings you need to change
before even beginning to do serious performance analysis. I'm hopeful
that we'll have better profiles by the time FCS runs around so that a
performance-based profile is easily available to end users. [There are
some conflicts between optimal settings for developers and production,
which is one cause of our problem here, not to mention some historical
baggage we have for backward-compatibilty. But that's a topic for
another day.]<br>
<br>
But once you have a reasonably configured appserver, ab is still not
the best tool to use to measure your performance. The biggest problem
is that ab is a single-threaded process, and you're typically
interested in measuring the performance of your multi-CPU machine
running the multi-threaded appserver. You can (I hope) see the inherent
problem: you have 1 CPU of client-side resources and, say, 4 CPUs of
server-side resources. Which side will become the bottleneck first? The
client side -- meaning all you've accomplished is measuring the
performance of ab itself.<br>
<br>
This all depends on what you're measuring, of course. Lately, using ab
to measure the retrieval of a single static image seems to be all the
rage, and this is the worst possible test. Let's say that it takes the
appserver 50% longer to process the request for http://host/foo.gif
than it takes for ab to send the request and parse the response to make
sure it came back correctly (and drain the socket of all the data).
Even that is unrealistic, but what it means is that you'll end up using
1.5 CPUs on your appserver by the time your client gets saturated.
Nothing you do to the appserver will make this better; the bottleneck
is ab.<br>
<br>
So now you're thinking: what if I have multiple CPUs on my client and I
use that -c option to ab: the option that's supposed to send
"concurrent" requests. Won't that scale? Unfortuantely not, because the
"concurrent" requests are still processed sequentially by ab. ab has
only a single thread available to it, so all it does is send multiple
requests (one after the other), read any responses that have been sent
back (still only one at a time), send any new requests, and so on. It
is still limited to utilizing at most a single CPU.<br>
<br>
And what of the timings you get out of this? The single ab thread sends
a request at time 0. Then if it has other responses to process, it will
do so. Say there are 10 more reponses to process (which means draining
the socket of data, and sending the next request on the socket), and
then say ab takes 10 milliseconds for each request. Only then will it
again look for a response to the original request. If the response to
the original request is waiting for ab, ab will report that it took 110
milliseconds for that request to be processed. But that's only because
ab itself spend 100 milliseconds handling other details; it has
erroneously charged all of that time it spends sequentially processing
data to the pending response. Client-side overhead in any
load-generating tool is a problem, but the sequential design of ab
makes the problem much worse in ab than in other load generators.<br>
<br>
Finally, what about those responses? If you run ab -c 100, there are
100 channels open to the server, and ab will report how much throughput
comes through those 100 channels. But it won't tell you anything about
fairness: 100 responses could come from one channel, or 1 response
could come from each channel, and ab will give you the same answer. In
fact, given its sequential design, an application server that responds
unfairly to requests will show better response times in ab than an
application server that responds to requests fairly. But somehow, I
don't think the actual users of the first application server will be
all too happy (well, one of them will be quite happy indeed!).<br>
<br>
Are there alternatives to ab? I'm quite happy with <a
 href="http://faban.sunsource.net/">faban</a>, an open-source
benchmarking toolkit developed by some of my colleagues. It is
multi-threaded, can access arbitrary URLs, and measures fairness among
other things. It is trickier to set up than ab, though in a future blog
I'll explore how it can be used as an ab alternative. Until then, if
someone offers you ab, just say no.<br>
</body>
</html>]]>

</content>
</entry>

</feed>