Pig from a Bird’s Eyeview
Pig from a Bird's Eyeview
by M. Tim Jones, author of Pig in Action
Today, we are being inundated with data. So much so, that IBM estimates that in 2012, 2.5 quintillion bytes of information are generated every day. That's a million trillion bytes of data, more data than we can fathom and considerably more than we can easily process. Online, every web-page we visit, every link we click, every phrase we search, every photo or video we upload, every "like," every purchase we make, and every comment we write generates some amount of data online with associated metadata (data about that data, such as date and time, and other useful information). Our online footprints generate not only large quantities of data, but equally large amounts of insight as this data can lead to real value if processed appropriately. This modern data deluge has been called Big Data, and it's only getting bigger. In this article based on chapter 1 of Pig in Action, author Tim Jones talks about how Pig democratizes Big Data. Save 42% on Pig in Action with Promotional Code piglaunchjn only at manning.com
Big data refers to data that requires new architectures and methods to process within a reasonable amount of time. As Moore's Law begins to slow, instead of building increasingly faster machines (called scale-up), we build architectures that scale out by adding compute and storage capacity. This approach allows an architecture to scale capacity to process data in parallel, both in terms of compute and storage. Hadoop is such a platform and is the de facto standard platform for large-scale data processing.
Understand Moore's Law Gordon Moore of Intel defined his law in a 1965 paper in which he predicted the density of transistors doubling every two years in the same physical area of an integrated circuit. This doubling of density relates to the complexity and performance of those integrated circuits, but in the next decade Moore's law will reach its physical limit.
Hadoop is a scalable and distributed platform that solves what are known as embarrassingly parallel problems. These are classified as problems that can be simply distributed and processed in parallel with little or no communication between the parallel tasks. This makes them very easy to process within clusters of servers using cost efficient Ethernet (and SATA SSDs for higher performance). Examples of embarrassingly parallel problems include search, fractal calculations (such as the Mandelbrot set) which allows each point to be calculated independently and genetic algorithms, which produce potential solutions to a given problem, in parallel.
We begin our discussion of Pig by going through what Pig can and can't do.
What Pig can and can't do
While Pig is very versatile and extensible, there are certain tasks that are better performed elsewhere (some of which are specific to the data processing platform, Hadoop). One of the most obvious deficiencies with Pig and Hadoop is that they are best applied to batch processing of large datasets. Due to latencies involved in Pig processing and Hadoop, it's not an ideal target for real-time or near real-time processing. But processing large stable datasets such as meteorological trends or web-server logs is a snap for Pig. In particular, Pig is perfect for data processing that involves a number of steps (a pipeline of processing). This tends to be more difficult in MapReduce, which is encoded as a map and reduce step.
Pig provides features to more easily implement complex aspects of MapReduce programming. Features like grouping and aggregation, sorting, and SQL-like joins are made easy with Pig, but can consume a considerable amount of development time when implemented in low-level MapReduce.
The ideal problems that can be solved by Pig and Hadoop are those that can be carved up, analyzed in pieces in parallel, and then put back together to arrive at a result. In our prior example, we aggregated customer purchases, which is an ideal example since we can sum in parallel. Analyzing relationships in the data can be difficult in Hadoop because once the data is broken up to process in parallel, each node lacks the entire picture exposed by the entire dataset.
Not all problems can be solved in this way, as some datasets illustrate dependence and cannot be manipulated separately. One example of this problem are those that rely on recursion (for example, Fibonacci sequences, which rely on prior values to produce the new one). Since MapReduce breaks a problem into many smaller sub problems to be processed independently, the dependence is lost. Luckily, many problems can be solved in this way. This includes text mining, collaborative filtering (machine learning), sentiment analysis, recommendation, effective ad targeting, pattern recognition, building prediction models, and building indexes for search.
But one of the most interesting and useful meta-problems solved with Pig is experimentation (data sandboxing). Pig makes it simple to build scripts to analyze data, experimenting with approaches to identify the best approach. Ultimately, this feature is what makes Pig a worthy technology to understand and a growing tool in data science.
The ecosystem building up around Hadoop proves that it is the de facto standard big data processing system. One of the most active areas of development is in the application space to assist in the development of data processing applications as an alternative to the native MapReduce.
One of the biggest differentiating features of Pig from other models is its ability to interactively manipulate datasets. This feature allows you to develop your script by interactively testing it on data, permitting faster script development since issues are found early. Most other solutions, like MapReduce, require you to run your script in a batch mode, so you see the result after the entire script/application has executed. This feature allows experimentation on datasets, and ad-hoc processing of data.
Some of the interesting competitors of Pig include Crunch, Cascading, and Hive. Apache Hive, originally developed by Facebook, is a data warehouse application built on top of Hadoop (see figure 1). It provides query and data analysis services using a SQL-like language called HiveQL. HiveQL is similar to the Structured Query Language (SQL, popular in relational database systems), but does not support the entire SQL specification (nor has Hive attempted to achieve this). HiveQL provides a useful query layer for the analysis of datasets. Like Pig, HiveQL queries are translated into MapReduce pipelines, easing the use of Hadoop for data processing.
Figure 1 Apache Hive Stack on Hadoop.
Hive is the closest competitive solution to Pig in its ability to compile SQL-like abstractions to MapReduce applications. The disadvantages to Hive are the additional requirements imposed upon a Hadoop cluster. Additionally, Hive focuses on structure data (similar to database systems) whereas Pig can process both structured data and unstructured data with ease.
As they satisfy slightly different goals, you'll find large Internet properties (like Yahoo!) use both Hive and Pig based upon the particular need.
The use of Pig
Not surprisingly, Pig has very wide and diverse use across the big-data industry. Yahoo! recently reported that almost half of its MapReduce jobs that execute in its large Hadoop cluster were a product of Pig (in support of web search and advertising systems). So instead of raw MapReduce jobs being hand coded, half of the jobs at Yahoo! originated from Pig scripts through the Pig compiler. This certainly demonstrates the value of Pig within a large production environment.
Many other companies make use of Pig in their production environments. Twitter uses Pig extensively to mine tweets and process usage logs (a common use-model). LinkedIn, a professional networking site, uses Pig to mine data to identify and suggest people that you may know and want to connect with (among other mining tasks). The WhitePages website uses Pig cleanse and filter its multi-billion record dataset in addition to analyzing web logs to identify performance indicators.
Of the major Pig users, most of the applications were in web-based analytics with machine learning and massive log processing. Many users have found the benefit of using Pig for ad-hoc exploration of large datasets, as Pig makes it easy to "play" with data. A growing use-model is the use of Pig to process unstructured data sets (such as audio or video data).
While developing applications for Hadoop can be complicated, and restricted to those with a computer science background, Pig scripts can be developed by anyone with a desire to process large datasets. Using Hadoop's scalable cluster architecture, and Pig's ability to generate efficient Map and Reduce data pipelines, no data is too big or complex to be reduced into useful insights. And while Pig isn't the only available solution, it's the best scripting language for interactively manipulating datasets using a data pipeline model that's easy to visualize.
Your first peek at Pig
To give you a first taste of Pig, the following short script loads a comma-delimited dataset (earthquakes over the last seven days from the U.S. Geological Survey), filters the dataset to emit only those records that match a given criteria (earthquake magnitude indicated by the 9th field of each tuple greater than 6.0), and then dumps the result to the screen.
grunt> rawdata = LOAD ‘eqs7day-M1.txt' USING PigStorage(‘,');
grunt> big_ones = FILTER rawdata by $8 > 6.0;
grunt> DUMP big_ones;
(us,b000gg01,9,"Friday, April 26, 2013 06:53:28 UTC",-28.7357,
[CA] -178.9155,6.2,349.00,112,"Kermadec Islands region")
(us,b000gen8,A,"Tuesday, April 23, 2013 23:14:42 UTC",
[CA] -3.9108,152.1266,6.5,16.30,100,"New Ireland region, Papua New Guinea")
This simple script has a very specific flow, from loading a dataset and reducing it to providing a result. This is a general pattern found in Pig and differs greatly from the MapReduce paradigm.
Pig is a solution to the problem of processing large data sets. Pig sits on top of Apache Hadoop, a distributed platform for data intensive applications. While Hadoop can process big data using its native Map/Reduce paradigm of processing, Pig makes data processing simpler. Pig allows you to think about problems in terms of data flows and transformations, compared to Map/Reduce, which requires a specific mindset for developing applications with map and reduce functions.
Here are some other Manning titles you might be interested in:
MongoDB in ActionKyle Banker
Big DataNathan Marz and James Warren
Hadoop in ActionChuck Lam