More on the simple vs. the complex
Yesterday, I wrote that I'm often asked which X is faster (for a variety of
X). The answer to that question always depends on your perspective. I answered that question in terms of hardware and concluded (as I always do) that the answer depends very much on your needs, but that a machine which appears slower in a single-threaded test will likely be faster in a multi-threaded world. You can't necessarily extrapolate results from a simple test to a complex system.
What about software? Today, I'll look at that question in terms of NIO. You're probably aware that the connection handler of glassfish is based on Grizzly, an NIO framework. Yet in recent weeks, we've read claims from the Mailinator author that
traditional I/O is faster than NIO. And a recent
from Jonathan Campbell shows traditional I/O-based appserver outperforming glassfish. So what gives?
Let's look more closely at the test Jonathan Campbell ran: even though it simulates multiple clients, the driver runs only a single request at a single time. It doesn't appear so on the surface, this is exactly an NIO issue; it has to do with how you architect servers to handle single-request streams vs. a conversational stream. A little know fact about glassfish is that is still contains a blocking, traditional I/O-based connector which is based on the Coyote connector from
Tomcat. It is enabled that in glassfish (adding the -Dcom.sun.enterprise.web.connector.useCoyoteConnector=true option to your jvm-options) -- but read this whole blog before you decide that using that connector is a good thing.
So I enabled this connector, got out my two-CPU Linux machine running Red Hat AS 3.0, and re-ran the benchmark Jonathan ran on glassfish and jBoss (I tried Geronimo, but when it didn't work for me, I abandonned it -- I'm sure I'd just done
something stupidly wrong in running it, but I didn't have the time to look into
it). I ran each appserver with the same JVM options, but did no other tuning.
And now that we're comparing the blocking, traditional I/O connectors, Glassfish comes out well on top (and, by comparison with Jonathan's numbers, it would easily have beat Geronimo as well):
So does this mean that traditional I/O is faster than NIO? For this test, yes, But in general? Not necessarily. So next, I wrote up a little Faban Driver that uses the same war file as the original test, but Faban will run the clients simultaneously instead of sequentially and continually pound on the same sessions. In my Faban test, I ran 100 clients, each of which had a 50 ms think time between repeated calls to the session validation servlet of the test. This gave me these calls per second:
- Glassfish with NIO (grizzly): 8192
- Glassfish with Std IO: 3344
- jBoss: 6953
Yes, those calls per second are vastly higher than the original benchmark -- the jRealBench driver is able to drive the CPU usage of my appserver machine only to about 15%. Faban can do better, though since the test is dominated by network
traffic, the CPU utilization is still only about 70%. And for glassfish's blocking connector, I had to increase the request-processing thread count to 100 (even so, there's probably something wrong with that result, but since the blocking connector is not really what we recommend you use in production, I'm not going to delve into it).
When scalability matters, NIO is faster than traditional blocking I/O. Which, of course, is why we use grizzly as the connector architecture for glassfish (and
why you probably should NOT run out and change your configuration to use the coyote connector, unless your appserver usage pattern is very much dominated by single request/response patterns). The complex is different than the simple.
As always, your milage will vary -- but the point is, are there tests where traditional I/O is faster than NIO? Of course -- with NIO, you always have the overhead of a select() system call, so when you measure the individual path, traditional I/O will always be faster. But when you need to scale, then NIO will generally be faster; the overehead of the select() call is outweighed by having fewer thread context switches, or by having long keep-alive times, or other options that architecture opens. Just as we saw with hardware, you can't necessarily extrapolate performance from the single, simple case to the complex system: you must test it to see how it behaves.