|
|
||||||||||||||||||||||
Scott Oaks's BlogApril 2008 ArchivesMore on the simple vs. the complexPosted by sdo on April 03, 2008 at 09:04 AM | Permalink | Comments (0)Yesterday, I wrote that I'm often asked which X is faster (for a variety of X). The answer to that question always depends on your perspective. I answered that question in terms of hardware and concluded (as I always do) that the answer depends very much on your needs, but that a machine which appears slower in a single-threaded test will likely be faster in a multi-threaded world. You can't necessarily extrapolate results from a simple test to a complex system. What about software? Today, I'll look at that question in terms of NIO. You're probably aware that the connection handler of glassfish is based on Grizzly, an NIO framework. Yet in recent weeks, we've read claims from the Mailinator author that traditional I/O is faster than NIO. And a recent blog from Jonathan Campbell shows traditional I/O-based appserver outperforming glassfish. So what gives? Let's look more closely at the test Jonathan Campbell ran: even though it simulates multiple clients, the driver runs only a single request at a single time. It doesn't appear so on the surface, this is exactly an NIO issue; it has to do with how you architect servers to handle single-request streams vs. a conversational stream. A little know fact about glassfish is that is still contains a blocking, traditional I/O-based connector which is based on the Coyote connector from Tomcat. It is enabled that in glassfish (adding the -Dcom.sun.enterprise.web.connector.useCoyoteConnector=true option to your jvm-options) -- but read this whole blog before you decide that using that connector is a good thing. So I enabled this connector, got out my two-CPU Linux machine running Red Hat AS 3.0, and re-ran the benchmark Jonathan ran on glassfish and jBoss (I tried Geronimo, but when it didn't work for me, I abandonned it -- I'm sure I'd just done something stupidly wrong in running it, but I didn't have the time to look into it). I ran each appserver with the same JVM options, but did no other tuning. And now that we're comparing the blocking, traditional I/O connectors, Glassfish comes out well on top (and, by comparison with Jonathan's numbers, it would easily have beat Geronimo as well):
So does this mean that traditional I/O is faster than NIO? For this test, yes, But in general? Not necessarily. So next, I wrote up a little Faban Driver that uses the same war file as the original test, but Faban will run the clients simultaneously instead of sequentially and continually pound on the same sessions. In my Faban test, I ran 100 clients, each of which had a 50 ms think time between repeated calls to the session validation servlet of the test. This gave me these calls per second:
When scalability matters, NIO is faster than traditional blocking I/O. Which, of course, is why we use grizzly as the connector architecture for glassfish (and why you probably should NOT run out and change your configuration to use the coyote connector, unless your appserver usage pattern is very much dominated by single request/response patterns). The complex is different than the simple. As always, your milage will vary -- but the point is, are there tests where traditional I/O is faster than NIO? Of course -- with NIO, you always have the overhead of a select() system call, so when you measure the individual path, traditional I/O will always be faster. But when you need to scale, then NIO will generally be faster; the overehead of the select() call is outweighed by having fewer thread context switches, or by having long keep-alive times, or other options that architecture opens. Just as we saw with hardware, you can't necessarily extrapolate performance from the single, simple case to the complex system: you must test it to see how it behaves. What does it mean to be faster?Posted by sdo on April 01, 2008 at 10:57 AM | Permalink | Comments (0)As a performance engineer, I'm often asked which X is faster (for a variety of X). The answer to that question always depends on your perspective. Today, I'll talk about the answer in terms of hardware and application servers. People quite often measure the performace of their appserver on, say, their laptop and a 6-core, 24-thread Sun Fire T1000 and are surprised that the cheaper laptop can serve single requests much faster than the more expensive server. There are technical reasons for this that I won't delve into -- there are architecture guides that go into all that. Rather I want to explore the question of which of these machines is actually faster, particularly in a Java EE context. In an appserver, you typically want to process multiple requests at the same time. So looking at the speed of a single request isn't really interesting: what is the speed of multiple requests? To answer this, I took a simple program that does a long-running nonsense calculation. Running this on my laptop and 24-thread T1000, I see the following times (in seconds) to calculate X items:
In the context of an appserver, think of the calculation as the time required for the business methods of your app. I've walked through this explanation a number of times, and often I'm told that the business method is the critical part of the app, and it must be done in .6 seconds for each user -- and hence the throughput of the T1000 isn't important. And that's fine: if you need to calculate a single method in .6 seconds, then you must use the single-threaded machine. But if you need to calculate two of those at the same time, then you'll need to get two of those machines, and if you need to calculate 24 of them, you'll need to get 24 machines. So this brings us back to our question: which machine is faster? And it depends on what you need. If you need to only do one calculation at a time, then the laptop is faster. If you need to do 3 or more calculations at the same time, then the T1000 is faster. Which is faster for you will depend on your application, your traffic model, and many other variables. As always, the best thing is to try your application, but if that's not feasible, be very careful about extrapolating whatever data you do have: you cannot simply extrapolate performance data from a simple (single-threaded) model to a complex system. | ||||||||||||||||||||||
|
|