Skip to main content

How to test container scalability

Posted by sdo on May 2, 2007 at 11:38 AM PDT

Recently, I've been asked a lot about Covalent Technologies report that
Tomcat
6 can scale to 16,000 users
and what that means for glassfish.
Since glassfish can easily scale to 16,000 users as well (as href="http://people.apache.org/%7Efhanik/tomcat/glassfish-reconf.txt">Covalent
found out once they href="http://weblogs.java.net/blog/jfarcand/archive/2007/01/configuring_gri.html">properly
configured glassfish), my reply has usually been accompanied by a
shrug: we've known for quite some time that NIO scales well.



But what does it mean to scale to N number of users, where N is large?
The answer is highly dependent on your benchmark, and in particular to
the think time that your benchmark uses. It's very easy to scale to
16,000 users if they each make a request every 90 seconds: that's on
the order of 180 requests/second. On the other hand, if there's no
think time in the equation, then continually handling 16,000 requests
is quite difficult, particularly on small machines. Closely related to
this is the response time of your requests: handling 16,000 requests
with an average response time of 10 seconds isn't particularly helpful
to your end users. But the most difficult aspect in scaling to 16,000
users is finding sufficient client horsepower to make sure that the
clients themselves aren't the bottleneck. Otherwise, any conclusions
you draw about the throughput or performance of the server are simply
wrong: the conclusions apply to the performance of the clients. So in
this blog, I'll explore how some of the considerations you need to
examine in order to benchmark a large system property.



I've written before about href="http://weblogs.java.net/blog/sdo/archive/2007/03/index.html">why
the Apache Benchmark can't handle this situation (surprisingly
enough, I'd been ranting against ab long before Covalent published
their benchmark; it's just fortuitous timing that they brought ab's
failings to light at the same time I was fed up with questions about ab
benchmarks from my colleagues). So for the tests I'll describe here, I
used Faban's new Common Driver. I've also previously written about how href="http://weblogs.java.net/blog/sdo/archive/2007/04/index.html">Faban
is a great, configurable benchmarking framework, but the new common
driver is a simple, command-line program that can benchmark requests of
a single URL. I ran the tests on a partitioned SunFire T2000. This
particular machines has 24 logical CPUs (6 cores with 4 hardware
threads each, but for our purposes, simply 24 CPUs), which I partioned
into a server set of 4 CPUs and a client set of 20 CPUs. Yes, it takes
20 CPUs to drive some of the tests I ran, and so for consistency, I
kept that configuration for all of them. But it's a crucial point: if
the client is a bottleneck, you're measuring the client performance,
not the server performance. Using a set of processors on a single
machine allowed me to run the tests bypassing the network, which also
removes a potential bottleneck from measuring the server performance.
Given that there are only 4 CPUs for the server, I configured all
containers to use 2 acceptor threads and 20 worker threads, and
otherwise followed Sun's and Covalent's blog entries on configuring the
containers.



I started with a simple test:

java -d64 -classpath $JAVA_HOME/lib/tools.jar:fabancommon.jar:fabandriver.jar \
   -Xmx3500m -Xms3500m com.sun.faban.driver.cd -c 30 http://localhost/tomcat.gif

This runs 30 separate clients (each in its own thread), each of
which continually requests tomcat.gif with no think time. You'll notice
we're using a 64-bit JVM for the test; eventually we'll be creating
16000 threads, which will require more than 4GB of address space. So to
make it easier for me, I used that JVM for all my tests. Have I
mentioned that driving a big client load requires a lot of resources so
that the client doesn't become the bottleneck?

The common driver reports three pieces of information: the number of
requests served per second, the average response time per request, and
the 90th percentile for requests: 90% of requests were served with that
particular response time or less. It will also report the number of
errors observed and some error conditions I'll discuss a little later.
I varied this test for different numbers of clients to see these
results:

# Users        Glassfish       Tomcat
  30          7552.9/0.004    7614.6/0.003
100         10004.6/0.009    7680.4/0.013
1000         12434.7/0.079    6880.3/0.145
5000          8942.7/0.534    7589.0/0.654

The results here are operations per second and the average response
time. I'd assume that I've misconfigured Tomcat's file cache here, but
the point isn't to make a comparison between the products absolute
performance; rather it is to explore issues around scalable
benchmarking. For static content, we get decent scaling, though at some
point there's enough requests so that the throughput of the server
suffers: just what we would expect. So what about a dynamic test? Here
are
some numbers from surfing to http://localhost/Ping/PingServlet -- which
is just a simple servlet that prints out 4 html strings and returns.

# Users        Glassfish       Tomcat
   30          5033.3/0.005   7154.0/0.004
  100          6359.5/0.015   7459.5/0.013
1000          7411.2/0.134   6483.2/0.154
5000          6060.1/0.818   6976.5/0.712
16000          6144.3/2.544   5263.0/2.375

Here the numbers are fairly close. At the low end, glassfish pays a
penalty for being a full Java EE container, which requires it to do
some additional work for the simple servlet. [Though the fact that the
glassfish ops/sec increases so much with more users is an indication
that there's probably some bottleneck we could fix in the code at 30
users; hmm...a performance engineer's work is never done.] That result
at 5000 users? I'll discuss it later, but it's an anomaly. But first:
what about 16,000 connections? In addition to producing low throughput,
the tomcat run also reported:

ERROR: Little's law failed verification: 16000 users requested; 13092.3455
users simulated.

In essence: almost 20% of the connections weren't serviced as expected
(glassfish reported a similar error).
I could repeat the test, and sometimes it would pass; sometimes it
would fail. But I'm clearly at the limit here of the hardware and
software. In this scenario, most of the errors are timeout errors on
connection: the server is to saturated in this test to accept new
connections. Note that that wouldn't happen with something like ab,
because ab's single-threaded nature inherently introduces an arbitrary
(and unmeasured) amount of think time into the equation. The amount of
think time is crucial, in that it drastically reduces the load on the
server; and an arbitrary amount think time is fatal, because we no longer know
what we're measuring.



To test this scenario properly, we introduce a deterministic think time
into the driver by including a -W 2000 parameter, which says each
client should have a 2 second (2000 ms) think time between requests. Now for
16,000 users, each server gave me these results:

                    Glassfish       Tomcat
ops/second           6988.9          6615.3
Avg. resp time        0.242           0.358
Max resp time         1.519           3.693
90% resp time           0.6            0.75

Now both containers are handling the 16000 users, but the data we get
regarding throughput and response time is valid.



Back to that result at 5000 users. The other interesting output from
the Faban common driver for the glassfish result was:

ERROR: Think time deviation too high; request 0; actual is 1.0

Or in the case of tomcat, the actual was 6.0 (accounting for their
better score) -- but the point is, although we didn't want think time
on the client, the client had some bottleneck that didn't allow it to keep up and
hence the benchmark result suffered. In effect, we ended up
benchmarking the client again, having yet again introduced an arbitrary,
non-deterministic think time. So even for 5000 users, we need to use
some think time to get an accurate assessment of the server behavior.
And so here are the results at 5000 users with a 500 millisecond think
time:

                     Glassfish       Tomcat
ops/sec               7607.25         7224.1
Avg. resp time          0.149          0.182
Max resp time           0.737          2.626
90% resp time            0.25           0.25

So does this any of this mean that glassfish is better than tomcat? For
some applications, probably. For others, probably not. The real point
to take away from this is an understanding of how important it is to
understand what you're measuring when you measure performance. The
tests I've run are much too simple to draw any conclusions from: the
only realistic benchmark is your own application. But hopefully, now
you have a better understanding of how to approach large-scale testing
your own
application.



The Common Driver for Faban is brand new code, so it hasn't yet been integrated into Faban's build schedule -- in fact, there is an issue with how it handles POST requests, which is what is delaying its integration. For now, you can download the fabancommon.jar and fabandriver.jar files I used for testing. If you find any problems with it (other than trying a POST request), be sure to let me know!

Related Topics >>