The story of the write cache and half a worm
We’ve been doing some performance testing of Derby, and we
discovered something that I suspect many you out there may not be
aware of. I know it caught some of us by surprise, and we’re
dealing with databases all the time.
First of all, let’s talk about your data. I think most of
you agree that when you store your data in a transactional database,
you expect it will meet the transactional guarantees of atomicity,
consistency, isolation and durability. I mean, why else go through
the trouble of using a database? For example, if you commit a
transaction, and your database says “OK”, then if in the
next moment the database crashes, when it comes back up, the data
should (a) still be there and (b) be consistent (e.g. it’s
worse finding half a worm in your apple than a whole worm).
blog I talked about how database systems sometimes don’t
provide that guarantee. They either aren’t transactional at
all, or by default they don’t write the log record to disk as
part of the commit. They take care of it after the commit, in a sort
of lazy way that significantly improves throughput. It reminds me of
a teenager promising to clean their room “later.” Maybe
they will, maybe they won’t. They even call it a “lazy
write,” and I get this image of the log subsystem hanging out
on the bed reading comic books. The problem with this approach is
that “later” never happens if there is a crash before the
system gets around to writing the log to disk.
Well, it turns out that some operating systems and hard drives
play the “later” game with you as well. At this level
this game is played by enabling the write cache by default, either in
the operating system or within the disk controllers. Linux and
Windows have the write cache enabled by default, as do ATA drives and
even RAID controllers. Solaris has its write cache turned off by
default, and also will try to
turn off the write caches on any drives attached to the
system (I have heard from my contacts in Solaris that many ATA
drives don’t even let the operating system turn off the
write cache -- you set the flag to turn the cache off, but the drive
controller basically ignores you. ATA vendors actually do not even
certify their disks for recovery with the write cache turned off).
Minor detail: if your disk crashes or there is a power failure,
you’ve lost some of your data. Actually, with a write cache
it’s even worse than if the database were doing the caching,
because you can also lose consistency (half a worm). Your filesystem
or database can become corrupted.
At least with the database-level optimizations, where the log is
written lazily, you are guaranteed consistency, if not durability.
The vendors know they do this, but it’s not very well
published. Why quietly enable it by default (or even prevent you from
disabling it)? It would seem the right thing to do for customers
would be to have the write cache off by default and let them
turn it on if they want to. You know, opt-in instead of opt-out.
One can only guess, but my strong suspicion for the reason behind
this approach is pure and simple: marketing. When you test with the
default configuration, your write-intensive applications scream. They
beat the competition out of the water.
So, you have two choices. You can either turn on the write cache,
and suffer the performance drop in the name of consistency and
durability, or take the risk that your database may become corrupted.
You can mitigate the risk using tricks like backup power supplies,
but the risk is still there. Not a fun choice to make, but it would
be great if this were a conscious choice by the customer instead of
one the vendors make for you.
There is another very important point here. It is very easy to
get the wrong impression about Solaris running with SCSI drives when
comparing it with other operating systems and disk types. So please,
when running database or other write-intensive performance
comparisons, make sure have the write cache on or off consistently.
Otherwise you’re comparing (wormy) apples to oranges.