|
|
||
Scott Oaks's BlogHow to test container scalabilityPosted by sdo on May 02, 2007 at 11:38 AM | Comments (10)Recently, I've been asked a lot about Covalent Technologies report that Tomcat 6 can scale to 16,000 users and what that means for glassfish. Since glassfish can easily scale to 16,000 users as well (as Covalent found out once they properly configured glassfish), my reply has usually been accompanied by a shrug: we've known for quite some time that NIO scales well. But what does it mean to scale to N number of users, where N is large? The answer is highly dependent on your benchmark, and in particular to the think time that your benchmark uses. It's very easy to scale to 16,000 users if they each make a request every 90 seconds: that's on the order of 180 requests/second. On the other hand, if there's no think time in the equation, then continually handling 16,000 requests is quite difficult, particularly on small machines. Closely related to this is the response time of your requests: handling 16,000 requests with an average response time of 10 seconds isn't particularly helpful to your end users. But the most difficult aspect in scaling to 16,000 users is finding sufficient client horsepower to make sure that the clients themselves aren't the bottleneck. Otherwise, any conclusions you draw about the throughput or performance of the server are simply wrong: the conclusions apply to the performance of the clients. So in this blog, I'll explore how some of the considerations you need to examine in order to benchmark a large system property. I've written before about why the Apache Benchmark can't handle this situation (surprisingly enough, I'd been ranting against ab long before Covalent published their benchmark; it's just fortuitous timing that they brought ab's failings to light at the same time I was fed up with questions about ab benchmarks from my colleagues). So for the tests I'll describe here, I used Faban's new Common Driver. I've also previously written about how Faban is a great, configurable benchmarking framework, but the new common driver is a simple, command-line program that can benchmark requests of a single URL. I ran the tests on a partitioned SunFire T2000. This particular machines has 24 logical CPUs (6 cores with 4 hardware threads each, but for our purposes, simply 24 CPUs), which I partioned into a server set of 4 CPUs and a client set of 20 CPUs. Yes, it takes 20 CPUs to drive some of the tests I ran, and so for consistency, I kept that configuration for all of them. But it's a crucial point: if the client is a bottleneck, you're measuring the client performance, not the server performance. Using a set of processors on a single machine allowed me to run the tests bypassing the network, which also removes a potential bottleneck from measuring the server performance. Given that there are only 4 CPUs for the server, I configured all containers to use 2 acceptor threads and 20 worker threads, and otherwise followed Sun's and Covalent's blog entries on configuring the containers. I started with a simple test: java -d64 -classpath $JAVA_HOME/lib/tools.jar:fabancommon.jar:fabandriver.jar \ -Xmx3500m -Xms3500m com.sun.faban.driver.cd -c 30 http://localhost/tomcat.gif This runs 30 separate clients (each in its own thread), each of
which continually requests tomcat.gif with no think time. You'll notice
we're using a 64-bit JVM for the test; eventually we'll be creating
16000 threads, which will require more than 4GB of address space. So to
make it easier for me, I used that JVM for all my tests. Have I
mentioned that driving a big client load requires a lot of resources so
that the client doesn't become the bottleneck? The common driver reports three pieces of information: the number of
requests served per second, the average response time per request, and
the 90th percentile for requests: 90% of requests were served with that
particular response time or less. It will also report the number of
errors observed and some error conditions I'll discuss a little later.
I varied this test for different numbers of clients to see these
results: # Users Glassfish Tomcat 30 7552.9/0.004 7614.6/0.003 100 10004.6/0.009 7680.4/0.013 1000 12434.7/0.079 6880.3/0.145 5000 8942.7/0.534 7589.0/0.654The results here are operations per second and the average response time. I'd assume that I've misconfigured Tomcat's file cache here, but the point isn't to make a comparison between the products absolute performance; rather it is to explore issues around scalable benchmarking. For static content, we get decent scaling, though at some point there's enough requests so that the throughput of the server suffers: just what we would expect. So what about a dynamic test? Here are some numbers from surfing to http://localhost/Ping/PingServlet -- which is just a simple servlet that prints out 4 html strings and returns. # Users Glassfish Tomcat 30 5033.3/0.005 7154.0/0.004 100 6359.5/0.015 7459.5/0.013 1000 7411.2/0.134 6483.2/0.154 5000 6060.1/0.818 6976.5/0.712 16000 6144.3/2.544 5263.0/2.375Here the numbers are fairly close. At the low end, glassfish pays a penalty for being a full Java EE container, which requires it to do some additional work for the simple servlet. [Though the fact that the glassfish ops/sec increases so much with more users is an indication that there's probably some bottleneck we could fix in the code at 30 users; hmm...a performance engineer's work is never done.] That result at 5000 users? I'll discuss it later, but it's an anomaly. But first: what about 16,000 connections? In addition to producing low throughput, the tomcat run also reported: ERROR: Little's law failed verification: 16000 users requested; 13092.3455 users simulated.In essence: almost 20% of the connections weren't serviced as expected (glassfish reported a similar error). I could repeat the test, and sometimes it would pass; sometimes it would fail. But I'm clearly at the limit here of the hardware and software. In this scenario, most of the errors are timeout errors on connection: the server is to saturated in this test to accept new connections. Note that that wouldn't happen with something like ab, because ab's single-threaded nature inherently introduces an arbitrary (and unmeasured) amount of think time into the equation. The amount of think time is crucial, in that it drastically reduces the load on the server; and an arbitrary amount think time is fatal, because we no longer know what we're measuring. To test this scenario properly, we introduce a deterministic think time into the driver by including a -W 2000 parameter, which says each client should have a 2 second (2000 ms) think time between requests. Now for 16,000 users, each server gave me these results:
Glassfish Tomcat
ops/second 6988.9 6615.3
Avg. resp time 0.242 0.358
Max resp time 1.519 3.693
90% resp time 0.6 0.75
Now both containers are handling the 16000 users, but the data we get
regarding throughput and response time is valid.Back to that result at 5000 users. The other interesting output from the Faban common driver for the glassfish result was: ERROR: Think time deviation too high; request 0; actual is 1.0 Or in the case of tomcat, the actual was 6.0 (accounting for their
better score) -- but the point is, although we didn't want think time
on the client, the client had some bottleneck that didn't allow it to keep up and
hence the benchmark result suffered. In effect, we ended up
benchmarking the client again, having yet again introduced an arbitrary,
non-deterministic think time. So even for 5000 users, we need to use
some think time to get an accurate assessment of the server behavior.
And so here are the results at 5000 users with a 500 millisecond think
time:
Glassfish Tomcat
ops/sec 7607.25 7224.1
Avg. resp time 0.149 0.182
Max resp time 0.737 2.626
90% resp time 0.25 0.25
So does this any of this mean that glassfish is better than tomcat? For
some applications, probably. For others, probably not. The real point
to take away from this is an understanding of how important it is to
understand what you're measuring when you measure performance. The
tests I've run are much too simple to draw any conclusions from: the
only realistic benchmark is your own application. But hopefully, now
you have a better understanding of how to approach large-scale testing
your own
application.The Common Driver for Faban is brand new code, so it hasn't yet been integrated into Faban's build schedule -- in fact, there is an issue with how it handles POST requests, which is what is delaying its integration. For now, you can download the fabancommon.jar and fabandriver.jar files I used for testing. If you find any problems with it (other than trying a POST request), be sure to let me know! Bookmark blog post: CommentsComments are listed in date ascending order (oldest first) | Post Comment
| ||
|
|