Skip to main content

FESI Studying NOSQL (Cassandra, Hadoop, and Voldemort)

Posted by fossesi on May 5, 2010 at 1:51 PM PDT

Last week's kickoff of FESI's research program went very well. There are a number of folks (>500) who are now following this blog on Java.net, and a number who have gotten involved.  While we wait for more folks, we'll be researching new technologies, the first being NOSQL databases. 

If the kind of topics we're research interests you, please feel free to join the project; we need to reach a critical mass of developers before we move on to our next phase of research.  While we wait, we will take a look at current open-source projects, identifying what's moving from bleeding-edge to becoming more accepted.  This will help us choose what technologies to research in-depth.  We'll start with NOSQL databases, but if anyone has a suggestion of another technology, we'll research that one next.

NOSQL databases seem to fit into the definition of being bleeding-edge to more accepted.  A nosql database is a database which is designed to be "non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases." (definition blatently stolen, from nosql-database.org) For the next week I'll be looking at three strong representatives in this category: Hadoop/HBase (they've got some great merchandise in thier online store, I recently got a Hadoop sweatshirt), Cassandra (what appears to be the strongest contender), and Voldemort (an open-source implementation of the Amazon Dynamo key-value store).  I chose these because they all have active, successful commercial implementations, are fairly unheard of as of yet, and I liked the names (yes, not the best criteria, but hey, its my blog, LOL).

To start, I'll install each of these and attempt a simple read and write.  Later this week, I'll share with you my experiences with Cassandra, next week Hadoop/HBase,  and the week after that Voldemort. While you wait for my next post, feel free to join the FossESI project, respond to this post, join our Facebook Fanpage (FossESI), follow us on Twitter, or play outside in the great summer weather.  Your choice. :-) 

Once we have a core of developers, we'll begin Phase 2 of our program, implementing a Struts/Servlet based set of simple applications on an open-source JEE container using a simple MySQL backend database. After this is done, we'll begin the third phase whose goal is to replace all of those technologies with the bleeding edge technologies available in the open-source community. Of course, as an open-source project, we'll make our integration code available via Java.net with the GPL 3.0 license.

Comments

Cassandra Installation Notes Posted

Just posted notes for installing Cassandra. Most of the Linux/Unix notes are already available, but we went the extra mile to explain how to install Cassandra on Windows Vista. Try it out! Sign up for the FossESI project to get the full text of the notes.

Please let me know that when

Please let me know that when we are starting with this.

It is in progress

Hi,

I've downloaded and Compiled Cassandra on my Windows Vista machine, now I'm attempting to get it running.  Its been pretty easy thus far. Feel free to follow along with, I'm sure we could trade secrets on its setup and usage!  Feel free to join the FossESI project to discuss further!

add MongoDB to that list

A group at the enterprise where I work did a similar study and proof-of-concept demo half a year ago. Despite that we are a java-only shop with regards to business logic (applications), they concluded that MongoDB had the most useful feature set for us. The usage scenarios where the other systems excel over MongoDB are just too extreme for most enterprises (at least for ours). I think Tokyo Tyrant actually was the runner-up, for more cache-like usage.

Great idea, Thanks!

Consider it added!  Most of my customers have the 1TB+ issue, making management of large, geographically disparate server farms all containing nodes of the database a major focus.  Part of my study would be how easy it is to add a new node to the existing node.  With that in mind, maybe what I'll do is hold two seperate database studies, one for massive databases, and the other for smaller databases (@100GB with single or dual nodes). What do you think?