Skip to main content

Database ACIDity

Posted by davidvc on August 9, 2005 at 2:37 PM EDT

A thread on Server Side talks about the relative performance of Derby compared to other databases, based on a series of tests runs at polepos. One person evaluating these results uncovered two very common mistakes that we have encountered when people are running performance comparisons with Derby. First of all, a number of databases Derby is compared with by default do not synchronize to disk on transaction commit. This “feature” significantly improves throughput, but what testers may not realize is that if you were to crash the database during your test, it is quite possible that some of your data will be lost (I speak from experience on this one). Now, this may be fine depending upon your durability requirements, but when running a performance test it's really important to get this right and make sure everyone's on a level playing field. What you should be looking for is that all the databases are configured to guarantee the same levels of ACID support (Atomicity, Consistency, Isolation and Durability). For example, see Dan Debrunner's blog on how HSQL is pretty loose in terms of ACID support; this is actually fine in certain environments, but it can result in some pretty misleading conclusions about relative performance.

The other thing I see over and over again is that JDBC statements are not prepared. You can create a JDBC statement in two ways: Connection.createStatement() and Connection.prepareStatement(). When you prepare a statement using prepareStatement(), you are telling the database that you intend to use the same statement multiple times. Most databases will use this hint to compile and cache the statement, saving significant overhead when you use it again. This is particularly important for Derby, because Derby compiles statements into Java bytecode. This involves a heavier up-front cost than other databases, but the result is that our statements are run directly by the Java bytecode interpreter. Depending on the VM you are using, the JIT compiler can compile this down directly to native code giving you significant performance improvements.

All of this to me underscores the dangers of comparing databases based solely on their performance. Don't get me wrong, performance and throughput is very important, and any evaluation of a database should involve performance comparisons. But the problem is that it's so easy to put up some blog or web page comparing database performance, and then someone then concludes “Database X is slow” or “Database Y is super-fast.” It creates impressions that are hard to break, and a database that might be perfect for you may not even make it on your radar.

What I always tell people is that, rather than depending on someone else to do your performance tests for you, you really need to take the time to do it yourself based on the characteristics of your specific application. What is the percentage of reads to writes? How many concurrent threads? What are your isolation requirements? What are your requirements for durability? Are there lots of short transactions or are the transactions long-lived? Are you running embedded or client/server? Are your queries mostly single-table queries based on primary and secondary keys, or do you have a lot of complex joins? And so on. And then, when you run your tests, make sure that the databases are configured to be as equal as possible. Of course one of the things you want to evaluate is how much work it takes to tune the database, or if you can even figure out how to do it, but ultimately your performance tests should, as much as possible, put the databases on equal footing.

It would be nice if there were a silver bullet benchmark that you could just crank all the databases through, but that's just not the way it is. One nice thing about Derby and many of the other open-source databases is you can download them and get going right away, and they're free, so you can do these kinds of evaluations without spending $$$ and days reading through the install guide and having to hire a DBA to tune it.

While I'm at it, another thing you should be aware of when evaluating Derby is that Derby does not automatically close result sets after all rows have been read. The JDBC 3 spec does not clearly specify the behavior to be expected here; some database vendors close the result set, others don't. The benefit of not closing the result set is that you can still get metadata about the result set even after having read all the data. The problem is that if you are using the same connection over and over again in a long-running test or application, at some point you will run out of memory. Note that in a web server/app server environment, if you are using a DataSource to get your connections, these connections are pooled, so even if you close them explicitly in your code they are not actually closed but simply put back on the pool. So if you want avoid these problems, and if you want to use JDBC in a truly portable way, make sure you close your result sets when you are done with them.