JavaOne 2013 Impressions #4: How to Utilize Hadoop to Process 30 Characters in just 34 Seconds! And How to Do Much, Much More...
What happens if you configure a Hadoop-centric scalable big data infrastructure running on the Amazon EC2 cloud, that's programmed to perform some very basic processing on delimited ASCII data records (about 30 characters in each record), and you feed it a single data record? How quickly do you think that massively powerful system would process that teensy, trivial input data set? Milliseconds? Microseconds? Nah, surely that type of system would blast through a single record in nanoseconds, right? Or even picoseconds or femtoseconds or attoseconds? In zeptoseconds? Yoctoseconds? (Anyone know what comes next?)
The answer, as it turns out, is 34 seconds. I found this out in the Wednesday JavaOne session Pragmatic Big Data Architectures in the Cloud: A Developer’s Perspective [CON5657] . The session was presented by Fabiane Nardon, Chief Scientist at Tail Target, and Fernando Barbadopulos, Tail Target's CTO.
34 seconds to process 30 characters using Hadoop on an Amazon EC2 cluster? Indeed...
The problem with feeding a Big Data infrastructure tiny data sets is that a Big Data infrastructure is tuned for processing Big Data. When data of any size is fed into such an infrastructure, a complex chain of events (all tuned in preparation for processing massive quantities of incoming data) is initiated. In the particular case of Fabiane's example problem, that initialization process takes about 34 seconds. Hence, processing even a single 30-byte character record takes 34 seconds.
The presentation opened and closed with the question: how big is BIG? Clearly, if you're going to implement a Big Data infrastructure to solve a problem, you need to feed that infrastructure a genuinely big data set. So, how big is BIG? Fabiane provided some metrics. For example, if you're not analyzing terabytes of data, that's probably not a Big Data analysis; if your data is not doubling in size annually, you're likely not addressing a Big Data problem.
Tail Target addresses genuine Big Data problems. They are currently working with a data set where:
- 6 Billion new data records are created each month;
- data constantly flows into their system from 20000 simultaneous internet connections;
- 150 Million people are mapped into the system (almost the entire populace of Brazil).
Fabiane went into some of the structural details of a Hadoop-based Big Data infrastructure, and highlighted some of the problems and performance issues the Tail Target team encountered as it developed a solution for the problem at hand.
At one point during her presentation, Fabiane paused and said: "It's so hard to do presentations today, because after 15 minutes everyone starts tweeting, checking email..." Then she implored the audience to "stay here!"
I don't know... I certainly didn't have any problem staying interested in this particular session!
Fabiane then continued her discussion on strategies for improving processing efficiency in Big Data applications.
After a while, Fernando took the stage to talk about strategies for achieving financial efficiency when utilizing the Amazon Elastic Compute Cloud (EC2).
The details of both the programming and financial optimization strategies will have to appear in a sequel to this post. I'll close, though, with a statement in one of the first slides that really reverberated for me. The slide was kind of an equation:
Big Data + Cloud => Disruptive Apps
That wasn't the exact geometry of the slide, but it made me think. Big Data is big because it's today's reality. Big Data is a problem for some, an opportunity for others (those who can mine the data for valuable information). Therefore, Big Data in itself is a disruptive reality.
The Cloud surely is disruptive in its own right (go talk to the companies that are losing business to newer cloud-based platforms if you don't believe me). But, the Cloud, as Fernando convincingly illustrated, provides an incredible opportunity for start-ups that need significant, scalable processing at a cost well below that of setting up their own data center. The possibility of a new company -- even a single person company -- having at its fingertips immediately scalable, powerful, highly-reliable processing resources, purchased on an as needed basis, is disruptive.
Put the two together: Big Data + Cloud, add in a developer team (even a team of one), and you've got a potency for disruption that's akin to Henry Ford inventing the auto assembly line. Except, in this case, the workers are all those Amazon EC2 processors, and the assembly line is the Map/Reduce Pipeline that's running on the Hadoop cluster.
Times, and technology, change -- but don't they also remain the same? Given the freedom to create, and access to the necessary tools, creative individuals and entrepreneurs will innovate. That's what's happening at Tail Target, anyway...