The Java.net JavaOne 2012 Conversations: Konstantin Shvachko
I introduced myself to Konstantin Shvachko after hearing him speak in the Duke's Choice Awards BOF session at JavaOne 2012 (Hadoop was a 2012 award winner). Konstantin happened to be sitting right next to me in the audience, so before we exited the session, I asked him if we could arrange a chat before JavaOne ended.
You've probably heard of Hadoop, but a great many likely don't know much about it. In the Duke's BOF, Hadoop was described as enabling distributed processing of big data sets across clusters of computers. The project was started as part of a 2004 web crawling project. Strategies from a Google paper were adapted to Java in the development of Hadoop.
In the BOF, Konstantin noted that Hadoop has been utilized for some unusual (apparently unanticipated) use cases. For example, dating sites use Hadoop a lot. Also, major oil companies use Hadoop in processing data gathered by boats that drag sounding sensors that measure and map the bottom of the ocean. Hadoop is also used by high-energy physics researchers who utilize colliders in their studies.
Konstantin concluded, in the BOF, that an operating system is a good analogy for what Hadoop is, and what it does.
The next day, when we met in the JavaOne Hilton lobby, I asked Konstantin to identify the largest operational Hadoop deployments. I was very surprised by his answer. The largest Hadoop deployments in terms of data volume include:
- Yahoo (a 20-Petabyte cluster)
- eBay (20-25 Petabyte cluster)
- Facebook (about 100 Petabytes)
At that point, my thought was: what? Hadoop is a critical infrastructural component that facilitates some of the biggest global sites? Yet, so few people have even heard of Hadoop?
Konstantin supplemented the statistics by noting that eBay implements about 1000 Hadoop nodes, and Yahoo implements about 4000 Hadoop nodes.
Konstantin then got more into the operating system analogy for Hadoop. Hadoop is like an operating system for distributed computing, where data management is the critical factor. I myself work on data analysis projects where my data center is receiving enormous amounts of data. My weekly meetings are filled with data management issues. Efficient parallel processing within clusters is essential for so many situations today.
Konstantin talked about one of his current areas of research and development. He's working on a new file system. The Hadoop Distributed File System (HDFS) may be sufficient for today's biggest sites. But, Konstantin notes, HDFS has its limitations: "There's a single point of failure, the master node. And, that master node can ultimately become a bottleneck."
So, considering this, does Konstantin just say "oh, well!"? Yeah, right! This is where Giraffa comes in. Konstantin describes Giraffa as being a means for replacing the HDFS master node with distributed clustered servers that facilitate the creation of clusters of master nodes.
Take a look at Giraffa to see what Konstantin's talking about here. It's about fail-safe protected replication of HDFS master nodes. So, today's single master node HDFS systems are sufficient to meet the needs of tiny sites like Yahoo, eBay, and Facebook... With Giraffa, Konstantin's looking ahead to a future that will be quite different from today. Since he's already preparing for that future, I think that when it arrives, we'll be ready too!
Thanks for paving the road ahead in advance, Konstantin! You're truly a visionary whose vision we need going forward!
- Cay Horstmann, Dynamic Types in Scala 2.10;
- Frans Thamura, BantuSekolahku - Support MySchool , a System to do reformation and transparency with social; and
- Manfred Riem, The StateHelper API.
Our current Java.net poll asks What best describes your current feeling about Gradle? Voting will be open until Friday, December 21.
Here are our latest Java.net Spotlights.
Reza Rahman - Adopt a Java EE 7 JSR!:
Broad community participation is key to the success of any technology worth it's salt. The Adopt-a-JSR program was launched in recognition of this fact. It is an initiative by some key JUG leaders around the World to encourage JUG members to get involved in a JSR and to evangelize that JSR to their JUG and the wider Java community, in order to increase grass roots participation...
Reza Rahman - Happy Birthday Java EE 6+GlassFish 3!:
It has been almost exactly three years since Java EE 6 and GlassFish 3 were announced. It's worth pausing a moment to take stock of what has happened since. Both Java EE 6 and GlassFish 3 have been game changers. EE 6 has brought Java EE back in the limelight. To see evidence of that look at presentations like these from independents like Bert Ertman and Paul Bakker...
Here are the stories we've recently featured in our Java News section:
- Hinkmond Wong: RPi and Java Embedded GPIO: Big Data and Java Technology;
- Geertjan Wielenga: NetBeans IDE 7.3 Knows Null;
- Heather VanCura: CP.Next - Early Adopters of JCP 2.8;
- Andrew Glover: The Java technical podcast series: The cloud files;
- Markus Eisele: Are your Garbage Collection Logs speaking to you? Censum does!; and
- Tori Wieldt: Java SE Updates.