The Source for Java Technology Collaboration
User: Password:



Tom White

Tom White's Blog

Hadoop + EC2 + S3

Posted by tomwhite on July 20, 2007 at 01:10 AM | Comments (7)

I've raved about the MapReduce parallel programming model in the past, and Apache Hadoop (the framework for running MapReduce applications), and Amazon's compute and storage webservices (EC2 and S3). Now I've written an article - Running Hadoop MapReduce on Amazon EC2 and Amazon S3 - about using them all together to do some data crunching.

The nice thing is that you can fire up a fair sized Hadoop cluster (20 nodes is the current limit on EC2) in minutes and run it just for as long as you need to run your job - you pay by the hour. EC2 is still in limited beta and has had long waiting lists to get on it, but recently they cleared the backlog, so if you're interested in trying it out, now might be a good time.


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • But back to java...does the hadoop ami have Java? I've attempted to install tomcat on the base but "yum" keeps crapping out after a few minutes...and that would be using the gcc java. I've also tried using a Tomcat rPath ami, but the php ssh login doesn't work.

    Seems really lonely out there when it comes to tomcat and ec2. Maybe this is yet another indicator that I need to just use Ruby.

    Taylor

    Posted by: tcowan on August 10, 2007 at 09:04 PM

  • Ok, just got one running, Hadoop has jdk 1.6.

    Taylor

    Posted by: tcowan on August 10, 2007 at 09:12 PM

  • One last post, for those of you who want to run tomcat on EC2.

    get latest hadoop ami.
    login, curl the tomcat tar ball.
    curl http://... > tomcat.tar.gz

    Posted by: tcowan on August 10, 2007 at 09:20 PM

  • I guess I'm way too late on this, but it would be nice if Hadoop MapReduce used generics: Mapper, Reducer...

    Posted by: tpeierls on September 16, 2007 at 12:17 PM

  • [oops, trying again] I guess I'm way too late on this, but it would be nice if Hadoop MapReduce used generics: Mapper<U,V>, Reducer<U>...

    Posted by: tpeierls on September 16, 2007 at 12:18 PM


  • Hi Tim,


    Hadoop MapReduce will have generics from version 0.15.0 which is due out early October. If you can't wait till then then use a nightly build from http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/

    If you want the gory implementation details and discussion you can find them in this Jira issue.

    Cheers, Tom

    Posted by: tomwhite on September 16, 2007 at 01:03 PM

  • Hi Tom,
    I am trying to implement a system for using hadoop/ec2/s3 as you wrote about in order to do some serious log crunching with persistent storage. The tutorial you wrote was very helpful in getting me headed towards the right direction. Unfortunately, I am currently trying to use S3 as my file system as per the instructions at http://wiki.apache.org/hadoop/AmazonS3 and I am running into a couple of problems. I was wondering if you could offer any insights:
    1. I modified the hadoop-ec2 startup scripts so that it uses s3 (along with the key, secret key, bucket) as the file system instead of specifying the master for fs.default.name. While everything seems to work fine, there is an error while starting up the cluster on ec2:

    localhost: starting secondarynamenode, logging to /mnt/hadoop/logs/hadoop-root-secondarynamenode-ip-10-251-70-194.out localhost: Exception in thread "main" java.lang.IllegalArgumentException: port out of range:-1 localhost: at java.net.InetSocketAddress.(InetSocketAddress.java:118) localhost: at org.apache.hadoop.dfs.DataNode.createSocketAddr(DataNode.java:104) localhost: at org.apache.hadoop.dfs.SecondaryNameNode.(SecondaryNameNode.java:94) localhost: at org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:481)

    this error is reproduced when I do the same on a single-node cluster on my local machine. Is this to be expected?
    2. I am also trying to be able to access the hdfs-on-s3 file system from a hadoop client that is not located on the original cluster. Is this possible? I tried just specifying S3 as the fs on my local machine but it seems to have its own view of the bucket (it can only see the files that it puts in itself, not the ones already put in there by the ec2 hadoop cluster). I thought this would be possible since I can terminate the ec2 cluster, spin it back up again, and see the hdfs-on-s3 fs as it was when I terminated, from any of the cluster machines. The reason I ask this is because I may want to migrate my hdfs-on-s3 fs away from ec2, so I want to ensure that I can plug the fs into a non-ec2 hadoop cluster if need be. I would think that there must be some extra steps that I am missing currently for properly plugging into the file system. How do I ensure that a hadoop machine/cluster gets the appropriate "view" of the S3 bucket that I want to see?
    Thanks!
    Clarence

    Posted by: perihelion on February 25, 2008 at 11:14 AM



Only logged in users may post comments. Login Here.


Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds