|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Scott Oaks's BlogScott Oaks works in the Java Performance group at Sun Microsystems, where he focuses on the performance of Java Enterprise Edition. He has worked with Java technology since 1996 and is the co-author of four books in the O'Reilly Java Series, including Java Threads (now in its third edition). More on the simple vs. the complexPosted by sdo on April 03, 2008 at 09:04 AM | Permalink | Comments (0)Yesterday, I wrote that I'm often asked which X is faster (for a variety of X). The answer to that question always depends on your perspective. I answered that question in terms of hardware and concluded (as I always do) that the answer depends very much on your needs, but that a machine which appears slower in a single-threaded test will likely be faster in a multi-threaded world. You can't necessarily extrapolate results from a simple test to a complex system. What about software? Today, I'll look at that question in terms of NIO. You're probably aware that the connection handler of glassfish is based on Grizzly, an NIO framework. Yet in recent weeks, we've read claims from the Mailinator author that traditional I/O is faster than NIO. And a recent blog from Jonathan Campbell shows traditional I/O-based appserver outperforming glassfish. So what gives? Let's look more closely at the test Jonathan Campbell ran: even though it simulates multiple clients, the driver runs only a single request at a single time. It doesn't appear so on the surface, this is exactly an NIO issue; it has to do with how you architect servers to handle single-request streams vs. a conversational stream. A little know fact about glassfish is that is still contains a blocking, traditional I/O-based connector which is based on the Coyote connector from Tomcat. It is enabled that in glassfish (adding the -Dcom.sun.enterprise.web.connector.useCoyoteConnector=true option to your jvm-options) -- but read this whole blog before you decide that using that connector is a good thing. So I enabled this connector, got out my two-CPU Linux machine running Red Hat AS 3.0, and re-ran the benchmark Jonathan ran on glassfish and jBoss (I tried Geronimo, but when it didn't work for me, I abandonned it -- I'm sure I'd just done something stupidly wrong in running it, but I didn't have the time to look into it). I ran each appserver with the same JVM options, but did no other tuning. And now that we're comparing the blocking, traditional I/O connectors, Glassfish comes out well on top (and, by comparison with Jonathan's numbers, it would easily have beat Geronimo as well):
So does this mean that traditional I/O is faster than NIO? For this test, yes, But in general? Not necessarily. So next, I wrote up a little Faban Driver that uses the same war file as the original test, but Faban will run the clients simultaneously instead of sequentially and continually pound on the same sessions. In my Faban test, I ran 100 clients, each of which had a 50 ms think time between repeated calls to the session validation servlet of the test. This gave me these calls per second:
When scalability matters, NIO is faster than traditional blocking I/O. Which, of course, is why we use grizzly as the connector architecture for glassfish (and why you probably should NOT run out and change your configuration to use the coyote connector, unless your appserver usage pattern is very much dominated by single request/response patterns). The complex is different than the simple. As always, your milage will vary -- but the point is, are there tests where traditional I/O is faster than NIO? Of course -- with NIO, you always have the overhead of a select() system call, so when you measure the individual path, traditional I/O will always be faster. But when you need to scale, then NIO will generally be faster; the overehead of the select() call is outweighed by having fewer thread context switches, or by having long keep-alive times, or other options that architecture opens. Just as we saw with hardware, you can't necessarily extrapolate performance from the single, simple case to the complex system: you must test it to see how it behaves. What does it mean to be faster?Posted by sdo on April 01, 2008 at 10:57 AM | Permalink | Comments (0)As a performance engineer, I'm often asked which X is faster (for a variety of X). The answer to that question always depends on your perspective. Today, I'll talk about the answer in terms of hardware and application servers. People quite often measure the performace of their appserver on, say, their laptop and a 6-core, 24-thread Sun Fire T1000 and are surprised that the cheaper laptop can serve single requests much faster than the more expensive server. There are technical reasons for this that I won't delve into -- there are architecture guides that go into all that. Rather I want to explore the question of which of these machines is actually faster, particularly in a Java EE context. In an appserver, you typically want to process multiple requests at the same time. So looking at the speed of a single request isn't really interesting: what is the speed of multiple requests? To answer this, I took a simple program that does a long-running nonsense calculation. Running this on my laptop and 24-thread T1000, I see the following times (in seconds) to calculate X items:
In the context of an appserver, think of the calculation as the time required for the business methods of your app. I've walked through this explanation a number of times, and often I'm told that the business method is the critical part of the app, and it must be done in .6 seconds for each user -- and hence the throughput of the T1000 isn't important. And that's fine: if you need to calculate a single method in .6 seconds, then you must use the single-threaded machine. But if you need to calculate two of those at the same time, then you'll need to get two of those machines, and if you need to calculate 24 of them, you'll need to get 24 machines. So this brings us back to our question: which machine is faster? And it depends on what you need. If you need to only do one calculation at a time, then the laptop is faster. If you need to do 3 or more calculations at the same time, then the T1000 is faster. Which is faster for you will depend on your application, your traffic model, and many other variables. As always, the best thing is to try your application, but if that's not feasible, be very careful about extrapolating whatever data you do have: you cannot simply extrapolate performance data from a simple (single-threaded) model to a complex system. Oh, go ahead -- prematurely optimizePosted by sdo on February 25, 2008 at 03:04 PM | Permalink | Comments (13)Recently, I've been reading an article entitled The Fallacy of Premature Optimization by Randall Hyde. I urge everyone to go read the full article, but I can't help summarizing some of it here -- it meshes so well with some of my conversations with developers over the past few years. Most people can quote the line "Premature optimization is the root of all evil" (which was popularized by Donald Knuth, but originally comes from Tony Hoare). Unfortunately, I (and apparently My. Hyde) come across too many developers who have taken this to mean that they don't have to care about the performance of their code at all, or at least not until the code is completed. This is just wrong. To begin, the complete quote is actually We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.I agree with the basic premise of what this says, and also with everything it does not say. In particular, this quote is abused in three ways. First, it is only talking about small efficiencies. If you're designing a multi-tier app that uses the network alot, you want to pay attention to the number of network calls you make and the data involved in them. Network calls are a large inefficiency. And not to pick on network calls -- experienced developers know what things are inefficient, and know to program them carefully from the start. Second, Hoare is saying (and Hyde and I agree) that you can safely ignore the small inefficiencies 97% of the time. That means that you should pay attention to small inefficiencies 1 out of every 33 lines of code you write. Third, and only somewhat relatedly, this quote builds into the perception that 80% of the time an application spends will be in 20% of the code, so we don't have to worry about our code's performance until we find out we're in the 80%. I'll present one example from glassfish to highlight those last two points. One day, we discovered that a particular test case for glassfish was bottlenecked on calls to Vector.size -- in particular, because of loops like this:
Vector v;
for (int i = 0; i < v.size(); i++)
process(v.get(i));
This is a suboptimal way to process a vector, and one of the 3% of cases you need to pay
attention to. The key reason here is because of the synchronization around vector, which
turns out to be quite expensive when this loop is the hot loop in your program. I know,
you've been told that uncontended access to a synchronized block is almost free, but that's
also not quite true -- crossing a synchronization boundary means that the JVM must flush all
instance variables presently held in registers to main memory. The synchronization boundary
also prevents the JVM from performing certain optimzations, because it limits how the JVM
can re-order the code. So we got a big performance boost by re-writing this as
ArrayList v;
for (int i = 0, j = v.size(); i < j; i++)
process(v.get(i));
Perhaps you're thinking that we needed to use a vector because of threading issues, but
look at that first loop again: it is not threadsafe. If this code is accessed by multiple
threads, then it's buggy in both cases.
What about that 80/20 rule? It's true that we found this case because it was consuming a lot (not 80%, but still a lot) of time in our program. [Which also means that fixing this case is tardy optimization, but there it is.] But the problem is that there wasn't just one loop written like this in the code; there were (and still are...sigh) hundreds. We fixed the few that we the worst offenders, but there are still many, many places in the code where this construct lives on. It's considered "too hard" to go change all the places where this occurs (though NetBeans could refactor it all pretty quickly, but there's a risk that subtle differences in the loop would mean that it would need to be refactored differently). When we addressed preformance in Glassfish V2 in order to get our excellent SPECjAppServer results, we fixed a lot of little things like this, because we spend 80% of our time in about 50% of our code. It's what I call performance death by a thousand cuts: it's great when you can find a simple CPU-intensive set of code to optimize. But it's even better if developers pay some attention to writing good, performant code at the outset and you don't have to track down hundreds of small things to fix. Hyde's full article has some excellent references for further reading, as well as other important points about why, in fact, paying attention to performance as you're developing is a necessary part of coding. |
May 2008
Search this blog:CategoriesOpen SourcePerformance Archives
April 2008 Recent EntriesMore on the simple vs. the complex What does it mean to be faster? Oh, go ahead -- prematurely optimize | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|