|
|
||
Tom White's BlogDistributed Archives"Disks have become tapes"Posted by tomwhite on March 18, 2008 at 06:07 AM | Permalink | Comments (3)MapReduce is a programming model for processing vast amounts of data. One of the reasons that it works so well is because it exploits a sweet spot of modern disk drive technology trends. In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk at the transfer rate of the disk. Contrast this to accessing data from a relational database that operates at the seek rate of the disk (seeking is the process of moving the disk's head to a particular place on the disk to read or write data). So why is this interesting? Well, look at the trends in seek time and transfer rate. Seek time has grown at about 5% a year, whereas transfer rate at about 20% [1]. Seek time is growing more slowly than transfer rate - so it pays to use a model that operates at the transfer rate. Which is what MapReduce does. I first saw this observation in Doug Cutting's talk, with Eric Baldeschwieler, at OSCON last year, where he worked through the numbers for updating a 1 terabyte database using the two paradigms B-Tree (seek-limited) and Sort/Merge (transfer-limited). (See the slides and video for more detail.) The general point was well summed up by Jim Gray in an interview in ACM Queue from 2003: ... programmers have to start thinking of the disk as a sequential device rather than a random access device.Or the more pithy: "Disks have become tapes." (Quoted by David DeWitt.) But even the growth of transfer rate is dwarfed by another measure of disk drives - capacity, which is growing at about 50% a year. David DeWitt argues that since the effective transfer rate of drives is falling we need database systems that work with this trend - such as column-store databases and wider use of compression (since this effectively increases the transfer rate of a disk). Of existing databases he says: Already we see transaction processing systems running on farms of mostly empty disk drives to obtain enough seeks/second to satisfy their transaction processing rates.But this applies to transfer rate too (or if it doesn't yet, it will). Replace "seeks" with "transfers" and "transaction processing" with "MapReduce" and I think over time we'll start seeing Hadoop installations that choose to use large numbers of smaller capacity disks to maximize their processing rates. [1] See Trends in Disk Technology by Michael D. Dahlin for changes between 1987-1994. For the period since then these figures still hold - as it's relatively easy to check using manufacturer's data sheets, although with seek time it's harder to tell since the definitions seem to change from year to year and from manufacturer to manufacturer. Still, 5% is generous. (Cross-posted at my other blog.) Consistent HashingPosted by tomwhite on November 27, 2007 at 09:56 AM | Permalink | Comments (2)I've bumped into consistent hashing a couple of times lately. The paper that introduced the idea (Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web by David Karger et al) appeared ten years ago, although recently it seems the idea has quietly been finding its way into more and more services, from Amazon's Dynamo to memcached (courtesy of Last.fm). So what is consistent hashing and why should you care? The need for consistent hashing arose from limitations experienced while running collections of caching machines - web caches, for example. If you have a collection of n cache machines then a common way of load balancing across them is to put object o in cache machine number hash(o) mod n. This works well until you add or remove cache machines (for whatever reason), for then n changes and every object is hashed to a new location. This can be catastrophic since the originating content servers are swamped with requests from the cache machines. It's as if the cache suddenly disappeared. Which it has, in a sense. (This is why you should care - consistent hashing is needed to avoid swamping your servers!) It would be nice if, when a cache machine was added, it took its fair share of objects from all the other cache machines. Equally, when a cache machine was removed, it would be nice if its objects were shared between the remaining machines. This is exactly what consistent hashing does - consistently maps objects to the same cache machine, as far as is possible, at least. The basic idea behind the consistent hashing algorithm is to hash both objects and caches using the same hash function. The reason to do this is to map the cache to an interval, which will contain a number of object hashes. If the cache is removed then its interval is taken over by a cache with an adjacent interval. All the other caches remain unchanged. Hadoop + EC2 + S3Posted by tomwhite on July 20, 2007 at 01:10 AM | Permalink | Comments (7)I've raved about the MapReduce parallel programming model in the past, and Apache Hadoop (the framework for running MapReduce applications), and Amazon's compute and storage webservices (EC2 and S3). Now I've written an article - Running Hadoop MapReduce on Amazon EC2 and Amazon S3 - about using them all together to do some data crunching. The nice thing is that you can fire up a fair sized Hadoop cluster (20 nodes is the current limit on EC2) in minutes and run it just for as long as you need to run your job - you pay by the hour. EC2 is still in limited beta and has had long waiting lists to get on it, but recently they cleared the backlog, so if you're interested in trying it out, now might be a good time. Affordable Web-Scale Computing ReduxPosted by tomwhite on August 24, 2006 at 02:01 PM | Permalink | Comments (1)In March I wrote of affordable web-scale computing:
Well, now it's possible with the beta launch of Amazon EC2 (picked up from the O'Reilly Radar). EC2 (apart from coincidentally being the postcode of my company's new London office) stands for Elastic Compute Cloud and allows you to commission compute resources on an on-demand basis using simple web-service-based tools. The unit of compute capacity is an Amazon Machine Image (AMI) - a Linux image - which you can configure to have any software you like on it. You can run any number of instances, paying for the number of instance hours you use (and the data you transfer). This goes beyond what I wished for in March as it allows you to run anything on the image! Going back to Hadoop and MapReduce, I can imagine a generic Hadoop AMI that you configure your job on, before commissioning a number of EC2 server instances to run it. Press go, wait for your job to complete and then decommission the server instances. Definitely one to watch. S3MapPosted by tomwhite on August 13, 2006 at 12:30 PM | Permalink | Comments (1)In case you haven't heard of it, Amazon S3 is a web service for storing data. The two great things about it are that it's simple (look at its nice REST API), and it's cheap (with a pay-as-you-go charging model). This latter point explains the growing number of startups that are using it to launch new business ventures: no data silos to maintain, and pay by the gigabyte. My favourite innovative service to use Amazon S3 uses AJAX to great effect to implement a wiki that stores its content on S3. Read all about it. If you think about it, this is an interactive service that resides entirely on S3, so it will scale and scale - there's no need for an application server. I think Content Management Systems could take this approach too.
It struck me that you could treat S3 as a big hashtable, so I tried writing an implementation of
This code prints:
Notice that keys are strings, but the objects in the map can be of any type, as long as they are Also, the map needs the connection parameters for an S3 account, and a bucket name. One bucket corresponds to one map. The map is persistent, so I can run the code above again, but without putting anything into the map, and the values put in previously will be retrieved and printed out exactly as before.
Writing an implementation of So, what is it useful for? Well, I haven't actually had a use for it yet, but S3Map is really a non-transactional persistent datastore, so it could be used in many scenarios, from applets (if the applet is hosted on S3 too), to desktop apps (to share user data between machines?), to server-side apps. If you're intrigued, you can try it out yourself by signing up for an Amazon S3 account, and downloading S3Map (it's hosted on S3 of course). Affordable Web-Scale ComputingPosted by tomwhite on March 17, 2006 at 06:10 AM | Permalink | Comments (0)With the launch of Amazon S3 (Simple Storage Service) we are seeing a continuation of the trend for the big web companies to monetize their computing infrastructure by opening it up to developers. It is probably only a matter of time before we see Google create something similar, which would essentially be a limited public interface onto the Google File System. I would love an API that exposes Google's MapReduce, a simple programming model for crunching on large datasets. You can write and run MapReduce programs today, using Hadoop, but it's only really useful if you have enough machines at your disposal. The pay-as-you-go model of S3 (and Sun Grid) would be very attractive to developers who want to run ad hoc computations, or can't afford the upfront investment in hardware. Hadoop!Posted by tomwhite on February 08, 2006 at 01:49 AM | Permalink | Comments (4)In a previous blog I wrote about Nutch's MapReduce implementation, for distributed processing of massive data sets. This, and the closely related Nutch Distributed File System (renamed Hadoop Distributed File System), have now been moved into a standalone project called Hadoop. According to Doug Cutting, who created Hadoop (as well as Lucene and Nutch), the name comes from: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term. An interesting way to name projects - how do you choose names? MapReducePosted by tomwhite on September 25, 2005 at 10:36 PM | Permalink | Comments (3)Doug Cutting has done it again. The creator of Lucene and Nutch has implemented (with Mike Cafarella and others) a distributed platform for high volume data processing called MapReduce. MapReduce is the brainchild of Google and is very well documented by Jeffrey Dean and Sanjay Ghemawat in their paper MapReduce: Simplified Data Processing on Large Clusters. In essence, it allows massive data sets to be processed in a distributed fashion by breaking the processing into many small computations of two types: a map operation that transforms the input into an intermediate representation, and a reduce function that recombines the intermediate representation into the final output. This processing model is ideal for the operations a search engine indexer like Nutch or Google needs to perform - like computing inlinks for URLs, or building inverted indexes - and it will transform Nutch into a scalable, distributed search engine. Nutch MapReduce takes advantage of the Nutch Distributed File System (NDFS) - itself inspired by another Google Labs project, the Google File System. NDFS provides a fault-tolerant environment for working with very large files using cheap commodity hardware. Currently MapReduce is a part of Nutch, but it has been proposed that it and NDFS be moved into a separate project. However, it is perfectly possible to use the MapReduce functionality in Nutch for your own data processing. In this blog, I'll briefly describe how to get started. | ||
|
|