Skip to main content

Database ACIDity

Posted by davidvc on August 9, 2005 at 11:37 AM PDT

A
thread on Server Side
talks about the relative performance of
Derby compared to other databases, based on a series of tests runs at
polepos. One person evaluating these results uncovered two very
common mistakes that we have encountered when people are running
performance comparisons with Derby.
First of all, a number of
databases Derby is compared with by default do not synchronize to
disk on transaction commit
.
This “feature” significantly improves throughput, but
what testers may not realize is that if you were to crash the
database during your test, it is quite possible that some of your
data will be lost (I speak from experience on this one). Now, this
may be fine depending upon your durability requirements, but when
running a performance test it's really important to get this right
and make sure everyone's on a level playing field. What you should
be looking for is that all the databases are configured to guarantee
the same levels of ACID support (Atomicity, Consistency, Isolation
and Durability). For example, see Dan
Debrunner's blog
on how HSQL is pretty loose in terms of ACID
support; this is actually fine in certain environments, but it can
result in some pretty misleading conclusions about relative
performance.


The other thing I see over and over
again is that JDBC statements are not prepared. You can create a
JDBC statement in two ways: Connection.createStatement()
and Connection.prepareStatement().
When you prepare a statement using prepareStatement(),
you are telling the database that you intend to use the same
statement multiple times. Most databases will use this hint to
compile and cache the statement, saving significant overhead when you
use it again. This is particularly important for Derby, because
Derby compiles statements into Java bytecode. This involves a
heavier up-front cost than other databases, but the result is that
our statements are run directly by the Java bytecode interpreter.
Depending on the VM you are using, the JIT compiler can compile this
down directly to native code giving you significant performance
improvements.


All of this to me underscores the
dangers of comparing databases based solely on their performance.
Don't get me wrong, performance and throughput is very important, and
any evaluation of a database should involve performance comparisons.
But the problem is that it's so easy to put up some blog or web page
comparing database performance, and then someone then concludes
“Database X is slow” or “Database Y is super-fast.”
It creates impressions that are hard to break, and a database that
might be perfect for you may not even make it on your radar.


What I always tell people is that,
rather than depending on someone else to do your performance tests
for you, you really need to take the time to do it yourself based on
the characteristics of your specific application. What is the
percentage of reads to writes? How many concurrent threads? What are
your isolation requirements? What are your requirements for
durability? Are there lots of short transactions or are the
transactions long-lived? Are you running embedded or client/server?
Are your queries mostly single-table queries based on primary and
secondary keys, or do you have a lot of complex joins? And so on.
And then, when you run your tests, make sure that the databases are
configured to be as equal as possible. Of course one of the things
you want to evaluate is how much work it takes to tune the database,
or if you can even figure out how to do it, but ultimately your
performance tests should, as much as possible, put the databases on
equal footing.


It would be nice if there were a silver
bullet benchmark that you could just crank all the databases through,
but that's just not the way it is. One nice thing about Derby and
many of the other open-source databases is you can download them and
get going right away, and they're free, so you can do these kinds of
evaluations without spending $$$ and days reading through the install
guide and having to hire a DBA to tune it.


While I'm at it, another thing you
should be aware of when evaluating Derby is that Derby does not
automatically close result sets after all rows have been read. The
JDBC 3 spec does not clearly specify the behavior to be expected
here; some database vendors close the result set, others don't. The
benefit of not closing the result set is that you can still get
metadata about the result set even after having read all the data.
The problem is that if you are using the same connection over and
over again in a long-running test or application, at some point you
will run out of memory. Note that in a web server/app server
environment, if you are using a DataSource to get your connections,
these connections are pooled, so even if you close them explicitly in
your code they are not actually closed but simply put back on the
pool. So if you want avoid these problems, and if you want to use
JDBC in a truly portable way, make sure you close your result sets
when you are done with them.