The Source for Java Technology Collaboration
User: Password:



Scott Oaks's Blog

Performance Archives


What does it mean to be faster?

Posted by sdo on April 01, 2008 at 10:57 AM | Permalink | Comments (0)

As a performance engineer, I'm often asked which X is faster (for a variety of X). The answer to that question always depends on your perspective.

Today, I'll talk about the answer in terms of hardware and application servers. People quite often measure the performace of their appserver on, say, their laptop and a 6-core, 24-thread Sun Fire T1000 and are surprised that the cheaper laptop can serve single requests much faster than the more expensive server.

There are technical reasons for this that I won't delve into -- there are architecture guides that go into all that. Rather I want to explore the question of which of these machines is actually faster, particularly in a Java EE context. In an appserver, you typically want to process multiple requests at the same time. So looking at the speed of a single request isn't really interesting: what is the speed of multiple requests?

To answer this, I took a simple program that does a long-running nonsense calculation. Running this on my laptop and 24-thread T1000, I see the following times (in seconds) to calculate X items:
# Items Laptop T1000
1 .66 1.3
2 1.4 1.5
4 2.8 1.6
8 5.4 2.5
16 10.8 3.7
24 16.6 4.8
As you'd expect, the performance of the laptop degrades linearly, to where it takes 16.6 seconds to perform 24 calculations. The performance of the T1000 isn't a linear scale, but even though it takes twice as as the laptop long to perform a single calculation, it can perform 24 calculations in one-third of the time of the laptop.

In the context of an appserver, think of the calculation as the time required for the business methods of your app. I've walked through this explanation a number of times, and often I'm told that the business method is the critical part of the app, and it must be done in .6 seconds for each user -- and hence the throughput of the T1000 isn't important. And that's fine: if you need to calculate a single method in .6 seconds, then you must use the single-threaded machine. But if you need to calculate two of those at the same time, then you'll need to get two of those machines, and if you need to calculate 24 of them, you'll need to get 24 machines.

So this brings us back to our question: which machine is faster? And it depends on what you need. If you need to only do one calculation at a time, then the laptop is faster. If you need to do 3 or more calculations at the same time, then the T1000 is faster. Which is faster for you will depend on your application, your traffic model, and many other variables. As always, the best thing is to try your application, but if that's not feasible, be very careful about extrapolating whatever data you do have: you cannot simply extrapolate performance data from a simple (single-threaded) model to a complex system.

Oh, go ahead -- prematurely optimize

Posted by sdo on February 25, 2008 at 03:04 PM | Permalink | Comments (13)

Recently, I've been reading an article entitled The Fallacy of Premature Optimization by Randall Hyde. I urge everyone to go read the full article, but I can't help summarizing some of it here -- it meshes so well with some of my conversations with developers over the past few years.

Most people can quote the line "Premature optimization is the root of all evil" (which was popularized by Donald Knuth, but originally comes from Tony Hoare). Unfortunately, I (and apparently My. Hyde) come across too many developers who have taken this to mean that they don't have to care about the performance of their code at all, or at least not until the code is completed. This is just wrong.

To begin, the complete quote is actually
We should forget about small efficiencies, say about 97% of the time: premature optimization
is the root of all evil.
I agree with the basic premise of what this says, and also with everything it does not say. In particular, this quote is abused in three ways.

First, it is only talking about small efficiencies. If you're designing a multi-tier app that uses the network alot, you want to pay attention to the number of network calls you make and the data involved in them. Network calls are a large inefficiency. And not to pick on network calls -- experienced developers know what things are inefficient, and know to program them carefully from the start.

Second, Hoare is saying (and Hyde and I agree) that you can safely ignore the small inefficiencies 97% of the time. That means that you should pay attention to small inefficiencies 1 out of every 33 lines of code you write.

Third, and only somewhat relatedly, this quote builds into the perception that 80% of the time an application spends will be in 20% of the code, so we don't have to worry about our code's performance until we find out we're in the 80%.

I'll present one example from glassfish to highlight those last two points. One day, we discovered that a particular test case for glassfish was bottlenecked on calls to Vector.size -- in particular, because of loops like this:
Vector v;
for (int i = 0; i < v.size(); i++)
     process(v.get(i));
This is a suboptimal way to process a vector, and one of the 3% of cases you need to pay attention to. The key reason here is because of the synchronization around vector, which turns out to be quite expensive when this loop is the hot loop in your program. I know, you've been told that uncontended access to a synchronized block is almost free, but that's also not quite true -- crossing a synchronization boundary means that the JVM must flush all instance variables presently held in registers to main memory. The synchronization boundary also prevents the JVM from performing certain optimzations, because it limits how the JVM can re-order the code. So we got a big performance boost by re-writing this as
ArrayList v;
for (int i = 0, j = v.size(); i < j; i++)
     process(v.get(i));
Perhaps you're thinking that we needed to use a vector because of threading issues, but look at that first loop again: it is not threadsafe. If this code is accessed by multiple threads, then it's buggy in both cases.

What about that 80/20 rule? It's true that we found this case because it was consuming a lot (not 80%, but still a lot) of time in our program. [Which also means that fixing this case is tardy optimization, but there it is.] But the problem is that there wasn't just one loop written like this in the code; there were (and still are...sigh) hundreds. We fixed the few that we the worst offenders, but there are still many, many places in the code where this construct lives on. It's considered "too hard" to go change all the places where this occurs (though NetBeans could refactor it all pretty quickly, but there's a risk that subtle differences in the loop would mean that it would need to be refactored differently).

When we addressed preformance in Glassfish V2 in order to get our excellent SPECjAppServer results, we fixed a lot of little things like this, because we spend 80% of our time in about 50% of our code. It's what I call performance death by a thousand cuts: it's great when you can find a simple CPU-intensive set of code to optimize. But it's even better if developers pay some attention to writing good, performant code at the outset and you don't have to track down hundreds of small things to fix.

Hyde's full article has some excellent references for further reading, as well as other important points about why, in fact, paying attention to performance as you're developing is a necessary part of coding.

Performance Stat of the Day

Posted by sdo on February 03, 2008 at 07:12 PM | Permalink | Comments (0)

I've written several times before about how you have to measure performance to understand how you're doing -- and so here's my favorite performance stat of the day: New York 17, New England 14.

Don't (necessarily) trust your tools

Posted by sdo on January 22, 2008 at 09:08 AM | Permalink | Comments (6)

I spent last week working with a customer in Phoenix (only a few weeks before the Giants go there to beat the Patriots), and one of the things we wanted to test was how their application would work with the new in-memory replication feature of the appserver. They brought along one of their apps, we installed it and used their jmeter test, and quickly verified that the in-memory session replication worked as expected in the face of a server failure.

Feeling confident about the functionality test, we did some performance testing using their jmeter script. We got quite good throughput from their test. But as we watched it run, we noticed jmeter reporting that the throughput kept continually decreasing. Since we were pulling the plug on instances in our 6-node cluster all the time, at first I just chalked it up to that. But then we ran a test without failing instances, and the same thing happened: continually decreasing performance.

Nothing is quite as embarrassing as showing off your product to a customer and having the product behave badly. I was ready to blame a host of things: botched installation, network interference, phases of the moon. Secretly, I was willing to blame the customer app: if there's a bug, it must be in their code, not ours.

Eventually, we simplified the test down to a single instance, no failover, and a single URL to a simple JSP: pretty basic stuff, and yet it still showed degradation over time (in fact, things got worse). Now there were two things left to blame: jmeter, or the phases of the moon. Neither seemed likely, until I took a closer look at what jmeter was doing: it turns on that the jmeter script was using an Aggregate Report. That report, in addition to updating the throughput for each request, also updates various statistics, including the 90% response time. It does this in real-time, which may seem like a good idea: but the problem is that calculating the 90% response time is an O(n) operation: the more requests jmeter made, the longer it took to calculate the 90% time.

I've previously written in other contexts about why tests with 0 think time are subject to misleading results. And it turns out this is another case of that: because there is no think time in the jmeter script, the time to calculate the 90% penalizes the total throughput. As the time to calculate the 90% increases, the time available for jmeter to make requests decreases, and hence the reported throughput decreases over time.

I'm not actually sure if jmeter is smart enough to do this calculation correctly even if there is think time between requests: will it just blindly sleep for the think time, or will it correctly calculate the think time minus its own processing time? For my test, it doesn't matter: the simpler thing is to use a different reporting tool that doesn't have the 90% calculation (which, I'm happy to report, showed glassfish/SJSAS 9.1 performing quite well with in-memory replication across the cluster and no degradation over time).

But what's more important to me is that it reinforces a lesson that I seem to have to relearn a lot: sometimes, your intuition is smarter than your tools. I had a strong intuition from the beginning that the test was flawed, but despite that, we spent a fair amount of time tracking down possible bugs in glassfish or the servlets.

And I also don't mean to limit this to a discussion of this particular bug/design issue with jmeter. When we tested startup for the appserver, a particular engineer was convinced that glassfish was idle for most of its startup time: the UNIX time command reported that the elapsed time to run asadmin start-domain was 30 seconds, but the CPU time used was only 1 or 2 seconds. The conclusion from that was that glassfish sat idle for 28 seconds. But intuitively, we knew that wasn't true (for one thing, the disk was cranking away all that time, and a quick glance at a CPU meter would disprove the theory that the CPU wasn't being used). And of course, it turns out that asadmin was starting processes which started processes, and shell timing code didn't understand all the descendant structure (particularly when intermediate processes exited but the grandchild process -- the appserver -- was still executing). The time command was just not suited to giving the desired answer.

Tools that give you visibility into your applications are invaluable; I'm not suggesting that when a tool gives you a result that you don't expect that you should blindly cling to your hypothesis anyway. But when a tool and your intuition are in conflict, don't be afraid to examine the possibility that the tool isn't measuring what you wanted it to.

A Glassfish Tuning Primer

Posted by sdo on December 03, 2007 at 11:25 AM | Permalink | Comments (1)

When I reported our recent excellent SPECjAppServer 2004 scores, one glassfish user responded:
I sure wish you guys were able to come up with a thorough write up
about the SPEC Benchmark architecture, and the techniques you guys
used to get the numbers you get and, more importantly, how those
techniques might apply to every day applications we run in the wild.
While we do have a full performance-tuning chapter in the glassfish/SJSAS docset, I can understand the appeal of a quick cheat-sheet for getting the most out of glassfish in production. Most of this information has appeared in various blogs, particularly by Jeanfrancois, who is so expertly focused on making sure that grizzly and our http path is as fast as possible. Still, I hope that gathering this quick list together will be a good single-source summary.

One thing to note about these guidlines: a lot of glassfish configurations (particularly when you start with a developer profile) are optimized for developers. In development, performance is different: you'll trade off a few seconds here and there to make starting the appserver faster, or deploying something faster. In production, you'll make opposite trade-offs. So if you wonder why some of the things in this list aren't necessarily the default setting, that's probably why.

Tune your JVM

The first step is to tune the JVM, which is of course different for every deployment. These are the options set via the jvm-option tag in your domain.xml (or the JVM options page in the admin console). As a general rule, I like to use the throughput collector with large heaps and a moderate-sized young generations: that makes young GCs quite fast. That will lead to a periodic full GC, but the impact of that on total throughput is usually quite minimal. If you absolutely cannot tolerate a pause of a few seconds, you can look at the concurrent collector, but be aware that this will impact your total throughput. So a good set of JVM arguments to start with are:
-server -Xmx3500m -Xms3500m -Xmn1500m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+AggressiveOpts
On a CMT machine like the SunFire T5220 server, you'll want to use large pages of 256m, and a heap that is a multiple of that:
-server -XX:LargePageSizeInBytes=256m -Xmx2560m -Xms2560m -Xmn1024m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=16 -XX:+AggressiveOpts
More details of the impact of a CMT machine are available at Sun's Cool Threads website.

Make sure to remove the -client option from your jvm options, to include the -Dcom.sun.enterprise.server.ss.ASQuickStartup=false flag, and -- if you are using CMP 2.1 entity beans -- to include -DAllowMediatedWriteInDefaultFetchGroup=true.

Tune the default-web.xml

Settings in the default-web.xml file are overridden by an application's web.xml, but I find it easier to set production-ready values in the default-web.xml file so that all applications will get them. In particular, under the JspServlet definition, add these two parameters:
<init-param>
  <param-name>development</param-name>
  <param-value>false</param-value>
</init-param>
<init-param>
  <param-name>genStrAsCharArray</param-name>
  <param-value>true</param-value>
</init-param>
That will mean you cannot change JSP pages on your production server without redeploying the application, but that's generally what you want anyway.

On note about this: this file is only consulted when an application is deployed. So make sure you change the file and then deploy your application, or you won't see any benefit from this change.

Tune the HTTP threads

As you know, there are two parameters here: the HTTP acceptor threads, and the request-processing threads. These value have unfortunately had different meanings in a few of our releases, and some confusion about them remains. The acceptor threads are used to both to accept new connections to the server and to schedule existing connections when a new request comes over them. In general, you'll need 1 of these for every 1-4 cores on your machine; no more than that (unlike, say SJSAS 8.1 where this had a completely different meaning). The request threads run HTTP requests. You want "just enough" of those: enough to keep the machine busy, but not so many that they compete for CPU resources -- if they compete for CPU resources, then your throughput will suffer greatly. Too many request processing threads is often a big performance problem I see on many machines.

How many is "just enough"? It depends, of course -- in a case where HTTP requests don't use any external resource and are hence CPU bound, you want only as many HTTP request processing threads as you have CPUs on the machine. But if the HTTP request makes a database call (even indirectly, like by using a JPA entity), the request will block while waiting for the database, and you could profitably run another thread. So this takes some trial and error, but start with the same number of threads as you have CPU and increase them until you no longer see an improvement in throughput.

Tune your JDBC drivers

Speaking of databases, it's quite important in glassfish to use JDBC drivers that perform statement caching; this allows the appserver to reuse prepared statements and is a huge performance win. The JDBC drivers that come bundled with the Sun Java Systems Application Server provide such caching; Oracle's standard JDBC drivers do as well, as do recent drivers for Postgres and MySQL. Whichever driver you use, make sure to configure the properties to use statement caching when you set up the JDBC connection pool -- e.g., for Oracle's JDBC drivers, include the properties
ImplicitCachingEnabled=true
MaxStatements=200

Use the HTTP file cache

If you serve a lot of static content, make sure to enable the HTTP file cache.



Have I piqued your interest? As I mentioned, there are hundreds of pages of tuning guidelines in our docset. But here at least you have some important first steps.

A scalable SPECjAppServer 2004 submission

Posted by sdo on November 26, 2007 at 09:16 AM | Permalink | Comments (3)

Last week, Sun published a new SPECjAppServer 2004 benchmark score: 8439.36 JOPS@Standard [1]. [I'd have written about it sooner, but it was published late Wednesday, and I had to go home and bake a lot of pies.] This is a "big" number, and frankly, it's the one thing that's been missing in our repertoire of submissions. We'd previously shown leading performance on a single chip, but workloads in general (and SPECjAppServer 2004 in particular) don't scale linearly as you increase the load. This number shows that we can scale our appserver across multiple nodes and machines quite well.

I've been asked quite a lot about what scalability actually means for this workload, so let me talk about Java EE scalability for a little bit. The first question I'm invariably asked is, isn't this just a case of throwing lots more hardware at the problem? Clearly, at a certain level the answer is yes: you can't do more work without more hardware. And I don't want to minimize the importance of the amount of hardware that you throw at the problem. There are presently two published SPECjAppServer scores that are higher than ours: HP/Oracle have results of 9459.19 JOPS@Standard [2] and 10519.43 JOPS@Standard [3]. Yet those results require 11 and 12 (respectively) appserver tier machines; our result uses only 6 appserver tier machines. More telling is that the database machine in our submission is a pretty beefy Sun Fire E6900 with 24 CPUs and 96GB of memory. Pretty beefy, that is, until you look at the HP/Oracle submissions that rely on 40 CPUs and 327GB of memory in two Superdome chasis. So yes, if you have millions (and I mean many millions -- ask your HP rep how much those two Superdomes will cost) of dollars to throw at the hardware, you can expect to get a quite high number on the benchmark.

The database, in fact, is one reason why most Java EE benchmarks (and workloads) will not scale linearly -- you can horizontally scale appserver tiers pretty well, but there is still only a single database that must handle an increasing load.

On the appserver side, horizontal scaling is not quite just a matter of throwing more hardware at the problem. SPECjAppServer 2004 is partitioned quite nicely: no failover between J2EE instances is required, connections to a particular instance are sticky, and the instances don't need to communicate with each other. All of that leads to quite nice linear scaling.

But one part of the benchmark doesn't scale linearly, because it is dependent on the size of the database. SPECjAppServer 2004 uses a bigger database for bigger configurations. For example, our previous submission on a single SunFire T2000 achieved a score of 883.66 JOPS@Standard [4]. The benchmark sizing rules meant that the database used for that configuration was only 10% as large at the database we used in our current submission. [More reason why that database scaling is important.] And in particular, it meant that the database in the small submission held 6000 items in the O_item table while our current submission had 60000 items in that table.

For SPECjAppServer 2004, that's important because the benchmark allows the appserver to cache that particular data in ead-only, container-managed EJB 2.1 entities. [That's a feature that's explicitly outside of the J2EE 1.3/1.4 specification, so your portable J2EE apps won't use it -- your portable Java EE 5 apps that use JPA can use cached database data, though somewhat differently.] Caching 6K items is something a single instance can do, but caching all 60K items will cause GC issues for the appserver. Hence, in some areas, the appserver will have to do more work as the database size increases, even if the total load per appserver instance is the same.

So a "big" score on this benchmark is a factor of two things: there are things within the appserver architecture that influence how well you will scale, even in a well-partitioned app. But the amount of hardware (and cost of that hardware) remains the key driving factor in just how high that score can go. As I've stressed many times, benchmarks like this are a proof-point: our previous numbers establish that we have quite excellent performance, and this number establishes that we can scale quite well. As always, the only relevant test remains your application: download the appserver now and see how well it responds to your requirements.

Finally, as always, some disclosures: SPEC and the benchmark name SPECjAppServer 2004 are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of 11/26/07. For the latest SPECjAppServer 2004 benchmark results, visit http://www.spec.org/. Referenced scores:
[1] Six Sun SPARC Enterprise T5120 (6 chip, 48 cores) appservers and one Sun Fire E6900 (24 chips, 48 cores) database; 8,439.36 JOPS@Standard
[2] Eleven HP BL860c (22 chips, 44 cores) appservers and two HP Superdomes (40 chips, 80 cores) database; 9,459.19 JOPS@Standard
[3] Twelve HP BL860c (24 chips, 48 cores) appservers and two HP Superdomes (40 chips, 80 cores) database; 10,519.43 JOPS@Standard
[4] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database; 883.66 SPECjAppServer2004 JOPS@Standard

SJSAS 9.1 (Glassfish V2) posts new SPECjAppServer 2004 result

Posted by sdo on July 10, 2007 at 07:27 AM | Permalink | Comments (2)

Today, Sun officially announced SPECjAppServer 2004 scores on our Sun Java Application Server 9.1, which (as you no doubt know) is the productized version of the open-source Glassfish V2 project. We've previously submitted results for SJSAS 9.0 (aka Glassfish V1), which at the time we were quite proud of: they were the only SPECjAppServer scores based on an open-source application server, and that gave us a quite good price/performance story. Considering where we started, I was happy to conclude that those scores were "good enough."

"Good enough" is no longer good enough. Today, we posted the highest ever score for SPECjAppServer 2004 on a single Sun Fire T2000 application server: 883.66 JOPS@Standard [1]. The Sun Fire T2000 in this case has a 1.4ghz CPU; the application also uses a Sun Fire T2000 running at 1.0ghz for its database tier. This result is 10% higher than WebLogic's score of 801.70 JOPS@Standard [2] on the same appserver machine. In addition, this result is almost 70% higher than our previous score of 521.42 JOPS@Standard on a Sun Fire T2000 [3], although that Sun Fire T2000 was running at only 1.2ghz. So that doesn't mean that we are 70% faster than we were, but we are quite substantially faster and are quite pleased to have the highest ever score on the Sun Fire T2000.

This result is personally gratifying to me in many ways, and I am proud of it (and proud of the work by the appserver engineers that it represents) on many, many levels. But it is just a benchmark, so let me touch on two things that means.

First, vendors and their marketing department love to play leap-frog games with benchmarks. My favorite example of this: some time ago, BEA posted a score of 615.64 JOPS@Standard [4] on the 1.2ghz T2000, only to be outdone a few months later by IBM WebSphere's score of 616.22 JOPS@Standard [5] on the same system. It's good marketing press, but at some point those sort of differences become slightly ridiculous to end users.

So yes, at some point it's conceivable that someone will post a higher score on this machine than we have; it's conceivable that I'll be back touting some improvements on our score (because my protestations about benchmarks aside, I'm not above playing the game either). But don't let any of that keep you from the point: this is a result that fundamentally changes the nature of that game. We used to be content with having a good result in terms of price/performance and watching IBM, Oracle, and BEA leap-frog among themselves in terms of raw performance. Now, we're the raw performance leader. There will be jockeying for position in the future, but we've changed forever the set of contenders. [We're also still quite interested in being price/performance leaders, by the way, which is why we also published a score this week using the free, open-source Postgres database.]

Second, remember that this is just a benchmark. Will you see similar results on your application? It depends. SPECjAppServer 2004 doesn't use EJB 3.0, JPA, WebServices, JSF, or any of a host of Java EE technologies (and frankly, I'm pretty happy with our performance in most of those areas; see, for example this article or this one on our WebServices performance). On the other hand, its performance is significantly affected by improvements we made to read-only EJBs, remote EJB invocation, and co-located JMS consumers and producers. So some of the improvements we've made may be in areas your application doesn't even use. [That's another reason I was happy with our previous scores: they established us as a viable appserver vendor, and I knew that customers who benchmarked their own applications would likely see better relative performance than that displayed by SPECjAppServer.]

Don't get me wrong: we have also made substantial performance improvements across the board: in the servlet connector and container, in JSP processing, in the local EJB container, in connection pooling, in CMP 2.1, and so on. This is really an important performance release for us. But as I always have said: the only realistic benchmark for your environment is your application. So go grab a recent build of glassfish V2, and see for yourself.

Now, as always, some disclosures: SPEC and the benchmark name SPECjAppServer 2004 are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of 07/10/06. The comparison presented is based on application servers run on the Sun Fire T2000 1.2 ghz and 1.4ghz servers. For the latest SPECjAppServer 2004 benchmark results, visit http://www.spec.org/. Referenced scores:
[1] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database; 883.66 SPECjAppServer2004 JOPS@Standard
[2] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database; 801.70 SPECjAppServer2004 JOPS@Standard
[3] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database ; 521.42 SPECjAppServer2004 JOPS@Standard
[4] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire V490 (4 chips, 8 cores, 2 cores/chip) database; 615.64 SPECjAppServer2004 JOPS@Standard
[5] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire X4200 (2 chips, 4 cores, 2 cores/chip) database; 616.22 SPECjAppServer2004 JOPS@Standard


Switching tracks

Posted by sdo on July 09, 2007 at 02:06 PM | Permalink | Comments (2)

One of those lesser-known features of Java is that it contains two different bytecodes for switch statements: a generic switch statement, and an (allegedly more optimal) table-driven switch statement. The compiler will automatically generate one or the other of these statements depending on the values in the switch statement: the table-driven statement is used when the switch values are close to being sequential (possibly with a few gaps), where the generic statement is used in all other cases. It's the sort of thing that intrigues performance-oriented developers: is the table-driven statement really more optimal? Is it worth coercing the variable involved in a switch statement so that the compiler can generate a table-driven statement? Is there ever a case in a real-world program where this would even matter? Interesting questions, but since I assumed the answer to the last one was "no", I never really thought about the first few.

Now, however, I'm looking at some profiles of Glassfish V2, and I find that when running a particular application, we're spending a full 1% of our time in this method:
protected java.util.logging.Level convertLevel(int level) {
int index = level / 100;
switch (index) {
case 3: return Level.FINEST;
case 4: return Level.FINER;
case 5: return Level.FINE;
case 7: return Level.CONFIG;
case 8: return Level.INFO;
case 9: return Level.WARNING;
case 10: return Level.SEVERE;
default: return Level.FINER;
}
}
Seems like a pretty simple method to be spending so much time in (and let's face it, sampling profilers may overstate their time for a method like this). So I dug in a little further. The level value passed to this method is always exactly divisible by 100: it's not the case that level can be 300, 305, and 310. So there is a one-to-one correspondence between the integers passed to the method and the Level object returned. So I was rather impressed that the original author of this code had known enough arcane Java trivia to know that he could coerce the argument to get the table-driven switch statement.

Alas, if only he'd taken the next step to see if the performance difference was worthwhile. It turns out that it wasn't: removing the division from this method and recasting the switch statement to values of 300, 400, and so on eliminted all the time the profiler attributed to this method and resulted in a .5% improvement in the way the application ran. I also did some quick micro-benchmarking of the method and discovered that if I didn't need to coerce the argument into the switch statement (that is, if I passed in values of 3, 4, 5, etc. to begin with), the perfomance of the method was essentially the same, but adding the division statement to coerce the argument slowed down execution of the method quite significantly.

At .5% of performance, I'm not sure that this is the real-world example of where this would ever matter -- though when you provide a platform for other people's applications, you worry about your operations being as streamlined as possible. But it is another example of why you should test you code before making assumptions about how it will perform, and particularly before writing code to work around a potential performance issue.


Dynamically sizing threadpools

Posted by sdo on June 07, 2007 at 10:15 AM | Permalink | Comments (2)

Almost every thread pool implementation takes great pains to make sure that it can dynamically resize the number of threads it utilizes: you specify the mininum number of threads you want, the maximum number, and the thread pool in its wisdom will automatically configure itself to have the optimal number of threads for your workload. At least, that's the theory...

But what about in practice? I'd argue that its utility is very limited, and that in many cases, a dynamically-resizing threadpool will actually harm to the performance of your system.

First, a quick review of why we have threadpools. From a perfomance perspective, the most important task of a threadpool is to throttle the number of simulatneous tasks running on your system. I know that you may think that the purpose of a threadpool is to allow you to conveniently run multiple things at once. It does that, but more importantly, it prevents you from running too many things at once. If you need to run 100 CPU-bound tasks on a machine with 4 CPUs, you will get optimal throughput if you run only 4 tasks at a time: each task fully utilizes the CPU while it is running. Since you can't run more that 4 tasks at once, you won't get get any better throughput by having more threads -- in fact, if you add more threads to the saturated system, your throughput will go down: the threads will compete with each other for CPU and other system resources, and the operating system will spend more time than necessary managing the competing threads.

In the real world, of course, tasks are never 100% CPU-bound, so you'll usually want more threads than CPUs to get optimal use of your system. How many more is a function of your workload:  how much time it waits for external resources like a database, and so on. But there will be an optimal number, usually quite less than the number of simultaneous tasks your can handle (particularly if those tasks represent jobs coming in from remote users -- e.g. a web or application server handling thousands of connections). The determining rule is this: is you have more tasks to perform AND you have idle CPU time, then it makes sense to add more threads to the pool. If you have more tasks to perform but no idle CPU time, then it is counter-productive to add threads to the pool. And that's my problem with dynamically resizing threadpools: if they choose to add threads because there are tasks waiting (even though there is no available CPU time), they will hurt your performance rather than help it.

Conceivably, you could use some native code to figure out the idle CPU time on your system and have a threadpool that takes that information into account. That would be better, but even that is insufficient. Say you have an application server accessing a remote database using JPA. Now if the database becomes a bottleneck, you'll have idle CPU time on your application server, and it will have tasks that are waiting. But adding threads to run those tasks will again make things worse: it will increase the work needed to be done by the already-saturated database, and your overall throughput will suffer. In the final analysis, you are the only one that will have all the necessary information to know if it is productive to increase the size of your thread pool.

So you are responsible for setting the maximum size of the threadpool to a reasonable value, so that the system will never attempt to run too many threads at once. Given you've done that, is there a point in having a mininum number of threads? The claim is that there is, because it can save on system resources. But I would argue that the impact of that is really minimal. Each thread has a stack and so consumes a certain amount of memory. But if the thread is idle and the machine doesn't have enough physical memory to handle everything on the system, that idle memory will simply be paged out to virtual memory. Even if the thread exits, the memory it used for its stack still belongs to the JVM process -- the JVM might reuse that memory for something else, but in general, the memory cannot be returned to the operating system for use by other processes. So the memory issue doesn't really have much impact. Depending on the application, it's conceivable that fewer idle threads may have a small impact because when a thread is reused, it might happen to have some important data in the CPU cache (whereas an idle thread selected to run a task won't have any data in the CPU cache), but the effects of that in the real world are pretty much non-existent. So it doesn't hurt to have a minimum number of threads, but you get no real advantage from that either.

One area that can be very subtle in this regard is the ThreadPoolExecutor, which can be configured to have three values: a minimum, a core value, and an absolute maximum. In general, threads are added when tasks are waiting until the system runs the desired core value of threads. Then everything chugs along nicely, even though a certain number of tasks may be waiting in the queue. Now say that the system can't keep up with the tasks queue: the task queue length grows beyond some defined value. In response to this, the executor will start adding threads (up to the absolute maximum). But if the system is CPU-bound, or if the system is causing a bottleneck on an external resource, adding those threads is exactly the wrong thing to do. And because this happens only under circumstances such as an increased load, it might be something that you fail to catch in normal testing: during normal testing, you'll usually run with the core number of threads and may not even notice that you've misconfigured the maximum number of threads to a value the system cannot handle. The converse of this argument is that the thread pool executor can add new threads when a burst of traffic comes, and as long as there are resources available to execute those threads, the executor can handle the additional tasks (and then, once the burst is over, the extra threads can exit and reduce system resource usage). But given the minimal-at-best effect that has on system resources, handling a burst like that doesn't make a lot of sense to me, particularly given the potential for increasing load on the system at exactly the wrong time.

All of that is why I always choose to ignore dynamically sizing threadpools, and just configure all my pools with a static size.


How to test container scalability

Posted by sdo on May 02, 2007 at 11:38 AM | Permalink | Comments (10)

Recently, I've been asked a lot about Covalent Technologies report that Tomcat 6 can scale to 16,000 users and what that means for glassfish. Since glassfish can easily scale to 16,000 users as well (as Covalent found out once they properly configured glassfish), my reply has usually been accompanied by a shrug: we've known for quite some time that NIO scales well.

But what does it mean to scale to N number of users, where N is large? The answer is highly dependent on your benchmark, and in particular to the think time that your benchmark uses. It's very easy to scale to 16,000 users if they each make a request every 90 seconds: that's on the order of 180 requests/second. On the other hand, if there's no think time in the equation, then continually handling 16,000 requests is quite difficult, particularly on small machines. Closely related to this is the response time of your requests: handling 16,000 requests with an average response time of 10 seconds isn't particularly helpful to your end users. But the most difficult aspect in scaling to 16,000 users is finding sufficient client horsepower to make sure that the clients themselves aren't the bottleneck. Otherwise, any conclusions you draw about the throughput or performance of the server are simply wrong: the conclusions apply to the performance of the clients. So in this blog, I'll explore how some of the considerations you need to examine in order to benchmark a large system property.

I've written before about why the Apache Benchmark can't handle this situation (surprisingly enough, I'd been ranting against ab long before Covalent published their benchmark; it's just fortuitous timing that they brought ab's failings to light at the same time I was fed up with questions about ab benchmarks from my colleagues). So for the tests I'll describe here, I used Faban's new Common Driver. I've also previously written about how Faban is a great, configurable benchmarking framework, but the new common driver is a simple, command-line program that can benchmark requests of a single URL. I ran the tests on a partitioned SunFire T2000. This particular machines has 24 logical CPUs (6 cores with 4 hardware threads each, but for our purposes, simply 24 CPUs), which I partioned into a server set of 4 CPUs and a client set of 20 CPUs. Yes, it takes 20 CPUs to drive some of the tests I ran, and so for consistency, I kept that configuration for all of them. But it's a crucial point: if the client is a bottleneck, you're measuring the client performance, not the server performance. Using a set of processors on a single machine allowed me to run the tests bypassing the network, which also removes a potential bottleneck from measuring the server performance. Given that there are only 4 CPUs for the server, I configured all containers to use 2 acceptor threads and 20 worker threads, and otherwise followed Sun's and Covalent's blog entries on configuring the containers.

I started with a simple test:
java -d64 -classpath $JAVA_HOME/lib/tools.jar:fabancommon.jar:fabandriver.jar \
   -Xmx3500m -Xms3500m com.sun.faban.driver.cd -c 30 http://localhost/tomcat.gif

This runs 30 separate clients (each in its own thread), each of which continually requests tomcat.gif with no think time. You'll notice we're using a 64-bit JVM for the test; eventually we'll be creating 16000 threads, which will require more than 4GB of address space. So to make it easier for me, I used that JVM for all my tests. Have I mentioned that driving a big client load requires a lot of resources so that the client doesn't become the bottleneck?

The common driver reports three pieces of information: the number of requests served per second, the average response time per request, and the 90th percentile for requests: 90% of requests were served with that particular response time or less. It will also report the number of errors observed and some error conditions I'll discuss a little later. I varied this test for different numbers of clients to see these results:

# Users        Glassfish       Tomcat
  30          7552.9/0.004    7614.6/0.003
 100         10004.6/0.009    7680.4/0.013
1000         12434.7/0.079    6880.3/0.145
5000          8942.7/0.534    7589.0/0.654
The results here are operations per second and the average response time. I'd assume that I've misconfigured Tomcat's file cache here, but the point isn't to make a comparison between the products absolute performance; rather it is to explore issues around scalable benchmarking. For static content, we get decent scaling, though at some point there's enough requests so that the throughput of the server suffers: just what we would expect. So what about a dynamic test? Here are some numbers from surfing to http://localhost/Ping/PingServlet -- which is just a simple servlet that prints out 4 html strings and returns.
# Users        Glassfish       Tomcat
   30          5033.3/0.005   7154.0/0.004
  100          6359.5/0.015   7459.5/0.013
 1000          7411.2/0.134   6483.2/0.154
 5000          6060.1/0.818   6976.5/0.712
16000          6144.3/2.544   5263.0/2.375
Here the numbers are fairly close. At the low end, glassfish pays a penalty for being a full Java EE container, which requires it to do some additional work for the simple servlet. [Though the fact that the glassfish ops/sec increases so much with more users is an indication that there's probably some bottleneck we could fix in the code at 30 users; hmm...a performance engineer's work is never done.] That result at 5000 users? I'll discuss it later, but it's an anomaly. But first: what about 16,000 connections? In addition to producing low throughput, the tomcat run also reported:
ERROR: Little's law failed verification: 16000 users requested; 13092.3455
users simulated.
In essence: almost 20% of the connections weren't serviced as expected (glassfish reported a similar error). I could repeat the test, and sometimes it would pass; sometimes it would fail. But I'm clearly at the limit here of the hardware and software. In this scenario, most of the errors are timeout errors on connection: the server is to saturated in this test to accept new connections. Note that that wouldn't happen with something like ab, because ab's single-threaded nature inherently introduces an arbitrary (and unmeasured) amount of think time into the equation. The amount of think time is crucial, in that it drastically reduces the load on the server; and an arbitrary amount think time is fatal, because we no longer know what we're measuring.

To test this scenario properly, we introduce a deterministic think time into the driver by including a -W 2000 parameter, which says each client should have a 2 second (2000 ms) think time between requests. Now for 16,000 users, each server gave me these results:
                    Glassfish       Tomcat
ops/second           6988.9          6615.3
Avg. resp time        0.242           0.358
Max resp time         1.519           3.693
90% resp time           0.6            0.75
Now both containers are handling the 16000 users, but the data we get regarding throughput and response time is valid.

Back to that result at 5000 users. The other interesting output from the Faban common driver for the glassfish result was:
ERROR: Think time deviation too high; request 0; actual is 1.0

Or in the case of tomcat, the actual was 6.0 (accounting for their better score) -- but the point is, although we didn't want think time on the client, the client had some bottleneck that didn't allow it to keep up and hence the benchmark result suffered. In effect, we ended up benchmarking the client again, having yet again introduced an arbitrary, non-deterministic think time. So even for 5000 users, we need to use some think time to get an accurate assessment of the server behavior. And so here are the results at 5000 users with a 500 millisecond think time:

                     Glassfish       Tomcat
ops/sec               7607.25         7224.1
Avg. resp time          0.149          0.182
Max resp time           0.737          2.626
90% resp time            0.25           0.25
So does this any of this mean that glassfish is better than tomcat? For some applications, probably. For others, probably not. The real point to take away from this is an understanding of how important it is to understand what you're measuring when you measure performance. The tests I've run are much too simple to draw any conclusions from: the only realistic benchmark is your own application. But hopefully, now you have a better understanding of how to approach large-scale testing your own application.

The Common Driver for Faban is brand new code, so it hasn't yet been integrated into Faban's build schedule -- in fact, there is an issue with how it handles POST requests, which is what is delaying its integration. For now, you can download the fabancommon.jar and fabandriver.jar files I used for testing. If you find any problems with it (other than trying a POST request), be sure to let me know!

ab considered harmful

Posted by sdo on March 23, 2007 at 03:09 PM | Permalink | Comments (2)

blpg For the fifth time this year, I've been contacted by a distraught user claiming that glassfish doesn't scale or run well based on results seen from ab (the Apache Benchmark). And so again, I've had to explain why ab is a terrible tool to use to measure the performance of your application (or web) server.

To be fair, glassfish does have some out-of-the-box settings that make its benchmark test results less than ideal. Jeanfrancois has this excellent blog that describes the basic settings you need to change before even beginning to do serious performance analysis. I'm hopeful that we'll have better profiles by the time FCS runs around so that a performance-based profile is easily available to end users. [There are some conflicts between optimal settings for developers and production, which is one cause of our problem here, not to mention some historical baggage we have for backward-compatibilty. But that's a topic for another day.]

But once you have a reasonably configured appserver, ab is still not the best tool to use to measure your performance. The biggest problem is that ab is a single-threaded process, and you're typically interested in measuring the performance of your multi-CPU machine running the multi-threaded appserver. You can (I hope) see the inherent problem: you have 1 CPU of client-side resources and, say, 4 CPUs of server-side resources. Which side will become the bottleneck first? The client side -- meaning all you've accomplished is measuring the performance of ab itself.

This all depends on what you're measuring, of course. Lately, using ab to measure the retrieval of a single static image seems to be all the rage, and this is the worst possible test. Let's say that it takes the appserver 50% longer to process the request for http://host/foo.gif than it takes for ab to send the request and parse the response to make sure it came back correctly (and drain the socket of all the data). Even that is unrealistic, but what it means is that you'll end up using 1.5 CPUs on your appserver by the time your client gets saturated. Nothing you do to the appserver will make this better; the bottleneck is ab.

So now you're thinking: what if I have multiple CPUs on my client and I use that -c option to ab: the option that's supposed to send "concurrent" requests. Won't that scale? Unfortuantely not, because the "concurrent" requests are still processed sequentially by ab. ab has only a single thread available to it, so all it does is send multiple requests (one after the other), read any responses that have been sent back (still only one at a time), send any new requests, and so on. It is still limited to utilizing at most a single CPU.

And what of the timings you get out of this? The single ab thread sends a request at time 0. Then if it has other responses to process, it will do so. Say there are 10 more reponses to process (which means draining the socket of data, and sending the next request on the socket), and then say ab takes 10 milliseconds for each request. Only then will it again look for a response to the original request. If the response to the original request is waiting for ab, ab will report that it took 110 milliseconds for that request to be processed. But that's only because ab itself spend 100 milliseconds handling other details; it has erroneously charged all of that time it spends sequentially processing data to the pending response. Client-side overhead in any load-generating tool is a problem, but the sequential design of ab makes the problem much worse in ab than in other load generators.

Finally, what about those responses? If you run ab -c 100, there are 100 channels open to the server, and ab will report how much throughput comes through those 100 channels. But it won't tell you anything about fairness: 100 responses could come from one channel, or 1 response could come from each channel, and ab will give you the same answer. In fact, given its sequential design, an application server that responds unfairly to requests will show better response times in ab than an application server that responds to requests fairly. But somehow, I don't think the actual users of the first application server will be all too happy (well, one of them will be quite happy indeed!).

Are there alternatives to ab? I'm quite happy with faban, an open-source benchmarking toolkit developed by some of my colleagues. It is multi-threaded, can access arbitrary URLs, and measures fairness among other things. It is trickier to set up than ab, though in a future blog I'll explore how it can be used as an ab alternative. Until then, if someone offers you ab, just say no.


New SPECjAppServer scores for Glassfish

Posted by sdo on December 14, 2006 at 03:47 PM | Permalink | Comments (5)

Today, Sun releases version 9.0 Update Release 1 Patch 1 of its application server (quite a mouthful!). See what's new in this release.

From my perspective, its most important fix is to a bug that caused the SPEC organization to mark our previously submitted SPECjAppServer scores as non-compliant. This allowed us to resubmit results on this benchmark.

Accordingly, a few weeks ago we resubmitted results on the Sun Fire T2000 Cool Threads server. This is clearly a machine of choice for Java EE applications, as almost every application server vendor who has submitted results for this benchmark has done so on that hardware.

This result is also the first benchmark in the industry to use the new, just-released, Java 6 JDK, which has some pretty impressive performance results to boast about as well.

The results show that glassfish is still clearly the price-performance leader for the application server tier.

Application Server Vendor    SPECjAppServer2004 JOPS@Standard    Application Tier $/SPECjAppServer2004 JOPS@Standard
Sun Java Systems AS 9.0 UR1 Patch 1 521.42 $51.81
BEA Weblogic 9.0 615.64 ??
IBM WebSphere 6.1 616.22 $122.13

Notes on pricing data: Pricing is calculated on the acquisition cost for the application server tier hardware and software. Pricing for the application server hardware is from http://www.sun.com/. Pricing for IBM WebSphere is from http://www.cdw.com/. All pricing is list pricing as of 12/14/06.

I'd really like to include BEA's cost in the above table, but BEA doesn't have transparent pricing. http://www.awaretechnologies.com/ has a price for a single core license of BEA 8.1 Advantage Edition at $10K. At that price, BEA would come in at $60.13/JOP. But of course, the T2000 has 8 cores, and BEA list price on that machine is likely closer to $20K, leaving them at $76.37/JOP. Alas, to be truly accurate, you'll have to get your BEA rep to give you their list price, and do the math yourself: ($26917 (HW price including OS Media) + BEA list Price) / 615.64.

More disclosures: SPEC and the benchmark name SPECjAppServer 2004 are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of 12/14/06. The comparison presented is based on application servers run on the Sun Fire T2000 1.2 ghz server. For the latest SPECjAppServer 2004 benchmark results, visit http://www.spec.org/.

Sun Posts SPECjAppServer2004 results using Glassfish

Posted by sdo on May 25, 2006 at 11:44 AM | Permalink | Comments (7)

Today, Sun posted our first-ever SPECjAppServer 2004 result on SJSAS 9.0 Platform Edition. This is the only SPECjAppServer result published so far on an open-source application server -- and the result used an open-source Operating System (Solaris 10) and open-source database (MySQL) as well. It is also the first (and so far only) SPECjAppServer result published on an application server that is certified to the Java EE 5 specification.

Sun posted a result of 712.87 SPECjAppserver 2004 JOPS@Standard on a configuration of 3 Sun Fire X4100 application servers and 1 Sun Fire X4100 database. Direct comparison to our previous result on 3 application servers (Sun Fire v20z machines) is a little tricky: the newer machines are dual-core and have a slightly faster clock speed, so you'd expect this newer configuration to have slightly better than 100% performance of the old configuration. Yet our result shows a 167% improvement over that previous submission -- a substantial improvement in the software layer no matter how you look at it.

Congratulations to everyone who worked on glassfish: the fact that we were able to get such an improvement while at the same time dealing with the aggressive schedule to support Java EE 5 and the new scenario of working in the open source community is a great achievement!

Benchmark Description
SPECjAppServer 2004 is a multi-tier benchmark for measuring the performance of Java 2 Enterprise Edition (J2EE) technology-based application servers. Moreover, SPECjAppServer2004 also heavily exercises all parts of the underlying infrastructure that make up the application environment, including hardware, JVM software, database software, JDBC drivers, and the system network. The primary metric of the SPECjAppServer2004 benchmark is jAppServer Operations Per Second ("SPECjAppServer2004 JOPS") in either @Standard or @Distributed mode.

Required Disclosure Statement:
SPECjAppServer2004 3x Sun Fire X4100 appservers (12 cores, 6 chips) and 1 Sun Fire X4100 database (4 cores, 2 chips) 712.87 SPECjAppServer2004 JOPS@Standard. SPECjAppServer 2004 3x Sun Fire V20z appservers (6 cores, 6 chips) and 1 Sun Fire V20z database (2 cores, 2 chips) 266.01 SPECjAppServer2004 JOPS@Standard. All results from www.spec.org as of 05/25/06. SPEC, SPECjAppServer reg tm of Standard Performance Evaluation Corporation

The NetBeans profiler -- change is good

Posted by sdo on February 13, 2006 at 01:38 PM | Permalink | Comments (2)

I am a creature of habit. At some level, I understand that a syntax-directed powerful editing tool might make me more productive. But vi has been good enough for me for the past 25 years; it will be good enough for the next 25. This is pretty much emblematic of my (problematic) approach to technology.

Why, then, have I recently switched to using the NetBeans profiler? Yes, it's free and powerful -- but that hasn't gotten me to switch away from vi and javac. And I'm a long-time user of OptimizeIt when I need to profile Java applications on Windows machines. Granted, my use of OptimizeIt is limited to Windows machines; for profiling on Sparc and Linux, I much prefer Sun's Studio 11 collector and analyzer; the fact that those tools are written in C means they can give me insight into the JVM that no Java-based profiler can. But still: what overcame my usual inertia and got me to move to the NetBeans profiler?

Quite simply, it works when other tools don't. For quite some time, the glassfish performance team has faced a regression in deployment times that manifests itself only on Windows. Months of analysis with OptimizeIt failed to make any progress. Fifteen minutes with the NetBeans profiler, and we had a fix.

Well, twenty minutes. One caveat about using the NetBeans profiler with glassfish: glassfish doesn't like the default installation path of Netbeans (or any path that contains a space in it). When it comes time to configure glassfish, if the path to the Netbeans profiler has a space, you'll face some difficulty. So my first recommendation: grab the Netbeans platform and profiler, and install it in a non-standard location (e.g., C:\NETBEANS).

Once it's installed, make sure to add the Netbeans profiler directory to your PATH variable (e.g. C:\NETBEANS\profiler\lib\deployed\jdk15\windows). Then fire up Netbeans, and click on the Profile menu and select the "Attach profiler" menu item. This displays a window, and the first thing you should notice in that window is that it contains an "Attach Wizard" button. That's right: to configure glassfish (and many other applications) to work with the Netbeans profiler, all you need to is proceed through a simple GUI. No dealing with difficult shell scripts to start your appserver [Hey, these productivity features might be worth it after all.]

I won't go into the details of the attach GUI, other than to point out that because glassfish is not yet released, Netbeans doesn't have a choice for that: you must select the Sun Java Systems Application Server 8.1. Because that is configured the same way as glassfish, you're all set. [If by chance you installed NetBeans in the default location, pay attention to the configuration info that the wizard tells you it added to the appserver's domain.xml file: you must edit the paths in that file to conform to that really intuitive Windows file naming convention that uses lots of UPPER~ names.]

Ease of use is a great thing, but that's not good enough for me. What did the tool show us that convinced me to switch? We started the appserver in profiling mode. After it started up, we reset the collected results in the profiling window (hmm, another cool feature), deployed our application, and took the following snapshot:
Like all Java-based profiers, the Netbeans profiler is somewhat confused about methods that block on I/O, so the poll method looks to be the most time-consuming. However, we know that can be discounted: the method has blocked in the kernel so isn't taking any CPU time. But the ZipFile.access$700 method is certainly suspicious. Indeed, repeating our experiment on the Sun Java Systems Application Server 8.2 (from which we were measuring the regression) makes it that much clearer: in the 8.2 profile, the access$700 method is called only some 40K times and requires only about 4999 ms of CPU time.

Looking at the backtraces of that method in glassfish, we see that it is called via two paths:
One path is clearly the compiler and is called 47K times; the other path turns out (when you expand the nextElement$1 node) to be from the creation of the EJB classloader associated with the application. In 8.2, however, the access$700 method is called only from the compiler; it is never called from the EJB classloader. A few minutes of detective work and we found that a change in the code creating the EJBClassloader meant that system jar files were incorrectly being added its search path, and reading those jar files led to all the new calls to the access$700 method.

We were searching for a regression, but note how easy this still would have been if our goal had simply been to make deployment faster: we would have immediately started to look for ways to call the access$700 method less often and been led to the same classloader code. With minimal effort, we know that to make deployment faster, we need to optimize the way in which the EJB classloader is created. [Strictly speaking, the conclusion we draw is that we are even better off making the compiler optimize the way in which it calls the access$700 method. But the compiler comes to us from another team; we can (and have) asked that team to optimize the code, but it's not something we can affect directly.]

Our OptimizeIt profiles never showed us this issue. They did point to the access$700 method as a possible hotspot (one of about eight!), but the call stack shown by OptimizeIt for that method shows only the compiler path: it was impossible to detect from the OptimizeIt profile that the access$700 method was ever called by anything but the javac compiler. A lengthy investigation (before the more accurate information from NetBeans was available) led us to prove that the javac invocation was identical and taking an identical amount of time; an unfortunate blind alley. And then the OptimizeIt profile led us completely astray to examine the other hotspots in the profile.

Had OptimizeIt solved my problem in the beginning, I might never have gotten around to trying the NetBeans profiler, despite entreaties from my friends and colleagues at NetBeans. Since Netbeans actually solved problems for me that nothing else could, I'm now its biggest fan. Now if it only had a vi keybinding mode for its editor...

Don't guess -- test

Posted by sdo on December 09, 2005 at 09:28 AM | Permalink | Comments (3)

One of the things that always interests me is the relative performance of the collection classes. Recently, I discovered a particular anomaly of the ConcurrentHashMap class.

I've always considered the ConcurrentHashMap class as something to be used in special cases: use a Hashtable, and if you notice a lot of contention for your hashtable, then switch to a ConcurrentHashMap. Of course, you always write your code in terms of the Map interface so that such a switch will be trivial, right?

This conviction stems partly from habit, partly from the fact that I strongly believe that simple code is faster (the Hashtable class is a much simpler implementation), and partly from some microbenchmarks I've run showing that when there is little or no contention, Hashtable is a faster implementation of the Map interface than ConcurrentHashMap. This is particularly true on recent VMs, which do a much better job at uncontended lock acquisition. [On the other hand, the ConcurrentHashMap greatly increases throughput when there is moderate to severe contetion for the map.]

Recently I ran across some newly written code that used ConcurrentHashMap in its initial implementation. It unit tested fine, of course, and we ran some simple performance tests on it, and it was still fine. And then we ran into an interesting test case, where we created thousands of the ConcurrentHashMap objects at a time (each one embedded in an Http session object).

It turns out that the size of an empty ConcurrentHashMap object is 1272 bytes; an empty Hashtable object is just 96 bytes. So forget any minor performance difference in storage and retrieval that might exist between the two; in this case, our GC times when using the ConcurrentHashMap dominated everything else. A simple one line change in the code, and we were back in business.

Will you see this type of thing in your app? Maybe not -- it is admittedly an unusual use case of the collection classes. But I like this example, since it reinforces my basic programming principles: start by using the simpler code, be prepared for changes, and don't expect that you'll understand the performance of your application until you test it under a variety of circumstances.

Java EE Performance at JavaOne

Posted by sdo on November 11, 2005 at 12:57 PM | Permalink | Comments (1)

I was a little surprised to find the JavaOne 2006 Call For Papers in my email this week; wasn't JavaOne 2005 just last month? It can't be mid-November; it's been 60 degrees for weeks in New York.

If you're interested in presenting anything releated to Java EE performance at JavaOne, I encourage you to submit an abstract. We did not have the largest selection of such talks last year, and I'd like to see a lot of performance talks this year.

By the same token, if there are performance-related topics you'd like to hear about at JavaOne, let us know.

I'm vacillating about what I'd like to talk about this year. On the one hand, I'd love people to hear about EJB 3.0 performance and enhancements we've done in grizzly. On the other hand, I've spent so much time this week dispelling performance myths and half-truths that I'm thinking a basic talk about performance may be what's called for.

I remember spending a lot of time 9 years ago dispelling myths about Java performance; in those days, parts of Java were indeed slow. But many other things contributed to performance as well; I remember the article by one Microsoft marketing person talking about an applet he was running, saying that the next step was to wait...and wait...and wait -- because, you see, Java is slow. Of course, he was waiting because he was downloading code over his 14K modem; his issues had nothing to do with Java's performance (even if it wasn't stellar at the time). But that was nine years ago, I remind myself.

So it was depressing to me this week to run into three instances of this sort of thing; apparently we haven't made that much progress in understanding performance. Two of these cases were by developers who ought to have learned better by now (and to be fair, they were willing to), and the third was yet more misanalysis of performance by BEA about SPECjAppServer scores.

The BEA case is of course all about marketing, but it's still depressing to see such misanalysis. In particular, BEA rightly argues that you can't just look at total JOPS and tell anything about a SPECjAppServer 2004 submission, and they further posit that what's important is determining the software and hardware required for your requirements. Exactly so.

Why, then, does BEA next show a calculation of $Hardware/Operations? Didn't they just say that software was an equally important member of the equation? Did they leave out software $ because they didn't want to draw attention to their licensing costs and change the equation out of their favor?

Then, after saying that what's important is $Hardware/Operations, BEA performs a completely different calculation of #CPUS/Operations, as if all CPUs and all systems cost the same amount of money.

Mind-boggling performance analyses like this make me feel that at a fundamental level, performance is still a misunderstood quantity, and rather than talking about the progress we've made, it's time to step back and (re-)learn some fundamentals.

Understanding Performance

Posted by sdo on September 30, 2005 at 12:05 PM | Permalink | Comments (1)

For the last few years, I've worked in the Java Performance Group at Sun Microsystems. So I thought it might be good to begin my first blog entry by talking about what's important in looking at performance.

I'm prompted to look into this topic because of a recent blog by Eric Stahl, who discusses the performance of SPECjAppServer 2004. The thing about application server benchmarks -- and especially the ones from SPEC -- is that they are trivial to scale horizontally. It's not enough just to look and see which application server has the highest score, because any vendor at any time can put together a larger configuration of machines and get the new world's record. [There's a slight complication here, in that the benchmark is a system benchmark, not an application server benchmark, and hence is subject to pressure from a back-end database. That's an important topic for another day, but it doesn't affect today's point.]

Interestingly, Eric understands this at some level, because he concludes by calculating how many transaction each application server can get on their respective CPUs. When you know that the total score doesn't mean anything, it's natural to attempt to normalize the numbers from disparate software and hardware combinations. And of course, this particular normalization allows him to show Sun's score in a bad light and conclude "this is a perfect example of the potential for free software, such as Sun's app server or JBoss, to drive costs up by needing significantly more hardware." Right instinct; wrong analysis.

It's the right instinct because performance isn't who has the highest benchmark number. Performance is who performs the work "best" -- where it's up to you to define best. Maybe someone actually would define best as "most transactions per CPU" (even if, as in this case, the CPUs aren't equivalent, which makes the calculation that more irrelevent). But perhaps you'd like to define best as "most cost-effective overall." I'd argue that definition has more merit.

Let's look at the acquisition costs of these results. By my calculations, BEA's application server costs around $120K to produce 1664 JOPS on machines that cost around $8K. Sun's application server is free but requires more ~$4K machines. So that translates to roughly $100 per transaction for BEA and about $53 for Sun. Of course, you may argue that I'm being overly simplistic; there are additional costs like support costs, and database costs, and so on. Some of those might be important to your decision making and some may not -- in particular, the backend database is going to support a certain number of transactions regardless of the front end, so leaving that out is a way to concentrate only on the relative merits of the appserver tier (plus, database hardware and software pricing makes price comparisons between disparate systems much less interesting).

SPEC makes it theoretically possible (though tedious) to figure this out for yourself; all submissions include a full Bill of Materials from which you can figure out the total cost of the submission (assuming vendors or their resellers have the relevent prices on their websites), or just the software, or just what's needed to run the appservers without the database, with our without supports costs, or whatever parts you want to include or isolate.

And that, of course, is the point: it's up to you to determine what's important when you make a software/hardware decision. Just don't be swayed by incomplete arguments that free software is going to cost you more in the end.



Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds