The Source for Java Technology Collaboration
User: Password:



Tom White's Blog

September 2005 Archives


MapReduce

Posted by tomwhite on September 25, 2005 at 10:36 PM | Permalink | Comments (3)

Doug Cutting has done it again. The creator of Lucene and Nutch has implemented (with Mike Cafarella and others) a distributed platform for high volume data processing called MapReduce.

MapReduce is the brainchild of Google and is very well documented by Jeffrey Dean and Sanjay Ghemawat in their paper MapReduce: Simplified Data Processing on Large Clusters. In essence, it allows massive data sets to be processed in a distributed fashion by breaking the processing into many small computations of two types: a map operation that transforms the input into an intermediate representation, and a reduce function that recombines the intermediate representation into the final output. This processing model is ideal for the operations a search engine indexer like Nutch or Google needs to perform - like computing inlinks for URLs, or building inverted indexes - and it will transform Nutch into a scalable, distributed search engine.

Nutch MapReduce takes advantage of the Nutch Distributed File System (NDFS) - itself inspired by another Google Labs project, the Google File System. NDFS provides a fault-tolerant environment for working with very large files using cheap commodity hardware.

Currently MapReduce is a part of Nutch, but it has been proposed that it and NDFS be moved into a separate project. However, it is perfectly possible to use the MapReduce functionality in Nutch for your own data processing. In this blog, I'll briefly describe how to get started.

Continue Reading...





Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds