Skip to main content

R and Streaming from Hadoop in Practice

Posted by manning_pubs on October 10, 2012 at 7:34 AM PDT



R and Streaming

by Alex Holmes, author of Hadoop in Practice

A data scientist needs a way to be able to use R in conjunction with Hadoop and bridge the gap between Hadoop and the huge database of learnings that exist in R. In this article, based on chapter 8 of Hadoop in Practice, author Alex Holmes shows how you can use R in combination with Hadoop Streaming.

With Hadoop Streaming, you can write Map and Reduce functions in any language that supports reading data from standard input, and writing to standard output. In this article, we'll look at how we can get Streaming to work directly with R in two steps, first in a map-only job, and then in a full MapReduce job. We'll be working with stock data and performing simple calculations. Our goal is to show how the integration of R and Hadoop can be made with Streaming.

Streaming and Map-only R

Just like with regular MapReduce we can have a Map-only job in Streaming and R. Map-only jobs make sense in situations where we don't care to join or group our data together in the Reducer.

Technique: calculate the daily mean for stocks

In this technique, we'll look at how Hadoop Streaming and R can be used on our stock data to calculate the daily means for each stock symbol.

PROBLEM

You want to integrate R and MapReduce.

SOLUTION

In this technique, we're going to work on the stock CSV file that contains the following elements for each stock.

Symbol,Date,Open,High,Low,Close,Volume,Adj Close

A subset of the contents of the file can be viewed below.

$ head -6 test-data/stocks.txt
AAPL,2009-01-02,85.88,91.04,85.16,90.75,26643400,90.75
AAPL,2008-01-02,199.27,200.26,192.55,194.84,38542100,194.84
AAPL,2007-01-03,86.29,86.58,81.90,83.80,44225700,83.80
AAPL,2006-01-03,72.38,74.75,72.25,74.75,28829800,74.75
AAPL,2005-01-03,64.78,65.11,62.60,63.29,24714000,31.65
AAPL,2004-01-02,21.55,21.75,21.18,21.28,5165800,10.64

Just like with regular MapReduce we can have a Map-only job in Streaming and R. Map-only jobs make sense in situations where we don't care to join or group our data together in the Reducer.

Technique: calculate the daily mean for stocks

In this technique, we'll look at how Hadoop Streaming and R can be used on our stock data to calculate the daily means for each stock symbol.

PROBLEM

You want to integrate R and MapReduce.

SOLUTION

In this technique, we're going to work on the stock CSV file which contains the following elements for each stock.

$ head -6 test-data/stocks.txt
AAPL,2009-01-02,85.88,91.04,85.16,90.75,26643400,90.75
AAPL,2008-01-02,199.27,200.26,192.55,194.84,38542100,194.84
AAPL,2007-01-03,86.29,86.58,81.90,83.80,44225700,83.80
AAPL,2006-01-03,72.38,74.75,72.25,74.75,28829800,74.75
AAPL,2005-01-03,64.78,65.11,62.60,63.29,24714000,31.65
AAPL,2004-01-02,21.55,21.75,21.18,21.28,5165800,10.64

In our job, we're going to calculate the daily mean for each line using the open and close prices. The R script to perform that task is shown below.

#! /usr/bin/env Rscript                                                 #1
options(warn=-1)                                                        #2
sink("/dev/null")                                                       #3
input <- file("stdin", "r")                                             #4

while(length(currentLine <-                                             #5
         readLines(input, n=1, warn=FALSE)) > 0) {

   fields <- unlist(strsplit(currentLine, ","))                         #6

   lowHigh <- c(as.double(fields[3]), as.double(fields[6]))             #7

   mean <- mean(lowHigh)                                                #8

   sink()                                                               #9

   cat(fields[1], fields[2], mean, "\n", sep="\t")                      #10

   sink("/dev/null")                                                    #11                                                                                            
}
close(input)


#1 Identifies the R process name which is used to execute this script.

#2 Disables warnings so they don't pollute our output.

#3 This sink function controls the destination of output. Since our code is being used in Hadoop Streaming we want control over what is written to standard output. Therefore, we redirect all R output (such as output which could be generated by third-party functions) to /dev/null until later in our code.

#4 Opens a handle to the process standard input.

#5 Reads a line from standard input. n is the number of lines that should be read. We set the warn to FALSE since we don't receive an EOF when reading from standard input. If we hit an empty line, we take that to mean we've hit the end of the input.

#6 Splits the string using comma as the separator, and flatten the resulting list into a vector.

#7 Creates a vector and add to it the stock open and close prices in numeric form.

#8 Calculates the mean of the open and close prices.

#9 Calling sink with no arguments restores the output destination so that we can write our data to standard output.

#10 Concatenate the stock symbol, date, open, close, and mean prices for the day and write them to standard output.

#11 Redirects all R output to /dev/null.


GitHub source: [src/main/R/ch8] stock_day_avg.R

DISCUSSION

Figure 1 shows how Streaming and R work together in a map-only job.

Figure 1 The R and Streaming Map-only data flow

Any MapReduce code can be challenging to test, but the great thing about using Hadoop Streaming code is that it's very easy to test on the command line without having to involve MapReduce at all. The following shows how the Linux cat utility (a simple utility to write the contents of a file to standard output) can be used to quickly test our R script to make sure the output is what we expect.

$ cat test-data/stocks.txt | src/main/R/ch8/stock_day_avg.R
AAPL 2009-01-02 88.315
AAPL 2008-01-02 197.055
AAPL 2007-01-03 85.045
AAPL 2006-01-03 73.565
...

That output looks good so I think we're ready to try running this in a Hadoop job.

$ export HADOOP_HOME=/usr/lib/hadoop                                  #1

$ ${HADOOP_HOME}/bin/hadoop fs -rmr output                            #2

$ ${HADOOP_HOME}/bin/hadoop fs -put test-data/stocks.txt \            #3

  stocks.txt

$ ${HADOOP_HOME}/bin/hadoop \
  jar ${HADOOP_HOME}/contrib/streaming/*.jar \                        #4
  -D mapreduce.job.reduces=0 \                                        #5
  -inputformat org.apache.hadoop.mapred.TextInputFormat \             #6
  -input stocks.txt \                                                 #7
  -output output \                                                    #8
  -mapper `pwd`/src/main/R/ch8/stock_day_avg.R \                      #9
  -file `pwd`/src/main/R/ch8/stock_day_avg.R                          #10


#1 Sets the location of your Hadoop installation.

#2 Removes the output directory in HDFS. If the directory doesn't exist, this will generate a warning that can be ignored.

#3 Copies the stocks data into HDFS.

#4 Specifies that we want to run the streaming JAR.

#5 We're running a Map-only job, so set the number of reducers to 0.

#6 Specifies the InputFormat for the job.

#7 Identifies the input file for the job.

#8 Sets the output directory for the job.

#9 Tells Streaming which executable should be executed in the Map phase.

#10 Specifies that the R executable should be copied into the Distributed Cache and made available to the Map tasks.

You may have noticed that we used TextInputFormat, which emits a key/value tuple where the key is the byte offset in the file and the value is the contents of a line. However, in our R script we were only supplied the value part of the tuple.

This is an optimization in Hadoop streaming, where, if it detects we are using TextInputFormat, it ignores the key from the TextInputFormat. If you want the key supplied to your script, then you can set the "stream.map.input.ignoreKey" Hadoop configuration to true.

Figure 2 shows some Streaming configuration settings which can be used to customize Map inputs and outputs.

Figure 1 shows how Streaming and R work together in a map-only job.

Figure 2 Streaming configurations for Map tasks

Now that we've covered how we can use R and Streaming for a Map-only job, let's see how we can get R working with a file Map and Reduce job.

Streaming, R, and full MapReduce

We'll now look at how we can integrate R with a full-blown MapReduce job. We'll build upon what we've learned with using Streaming and a Map-side R function, and introduce a Reduce-side function. In doing so, we'll learn how Hadoop Streaming supplies Map output keys and the list of Map output value tuples to the standard input of the R function, and how the R function outputs are collected.

Technique: calculate the cumulative moving average for stocks

Our previous technique calculated the daily mean for each stock symbol, and we're going to use the MapReduce framework to group together all the data for each stock symbol across multiple days and then calculate a cumulative moving average (CMA) over that data.

PROBLEM>

You want to integrate R and Streaming in both the Map and Reduce sides.

SOLUTION

If you recall back to our Map-side technique, the Map R script emitted tab-separated output with the following fields.

Symbol Date Mean

MapReduce will sort and group together the output keys of our Map script, which is the stock symbol. For each unique stock symbol, MapReduce will feed our Reduce R script with all the Map output values for that stock symbol. Our script will sum the means together and emit a single output containing the CMA.

#! /usr/bin/env Rscript
options(warn=-1)
sink("/dev/null")

outputMean <- function(stock, means) {                                 #1
  stock_mean <- mean(means)
  sink()
  cat(stock, stock_mean, "\n", sep="\t")
  sink("/dev/null")
}

input <- file("stdin", "r")
prevKey <- ""

means <- numeric(0)

while(length(currentLine <- readLines(input, n=1, warn=FALSE)) > 0) {

  fields <- unlist(strsplit(currentLine, "\t"))

  key <- fields[1]                                                    #2
  mean <- as.double(fields[3])                                        #3

  if( identical(prevKey, "") || identical(prevKey, key)) {
    prevKey <- key
    means <- c(means, mean)
  } else {
    outputMean(prevKey, means)                                        #4
    prevKey <- key
    means <- c(means, mean)
  }
}

if(!identical(prevKey, "")) {
  outputMean(prevKey, means)
}

close(input)


#1 A simple R function that takes as input the stock symbol and a vector of means. It calculates the CMA and writes the symbol and CAM to standard output.

#2 Reads the key, which is the stock symbol.

#3 Reads the mean from the input.

#4 Once we find a new key it means, we've hit a new map output key. This means it's time to call the function to calculate the CAM and write the output to standard out.

GitHub source: [src/main/R/ch8] stock_cma.R

DISCUSSION

Figure 3 shows how Streaming and our R script work together in the Reduce side.

Figure 3 The R and Streaming MapReduce data flow

Again, the beauty of streaming is that we can easily test it with streaming Linux commands.

$ cat test-data/stocks.txt | src/main/R/ch8/stock_day_avg.R | \
  sort --key 1,1 | src/main/R/ch8/stock_cma.R
AAPL 68.997
CSCO 49.94775
GOOG 123.9468
MSFT 101.297
YHOO 94.55789

That output looks good, so I think we're ready to try running this in a Hadoop job.

$ export HADOOP_HOME=/usr/lib/hadoop

$ ${HADOOP_HOME}/bin/hadoop fs -rmr output

$ ${HADOOP_HOME}/bin/hadoop fs -put test-data/stocks.txt stocks.txt

$ ${HADOOP_HOME}/bin/hadoop \
  jar ${HADOOP_HOME}/contrib/streaming/*.jar \
  -inputformat org.apache.hadoop.mapred.TextInputFormat \
  -input stocks.txt \
  -output output \
  -mapper `pwd`/src/main/R/ch8/stock_day_avg.R \                        #1
  -reducer `pwd`/src/main/R/ch8/stock_cma.R \                           #2
  -file `pwd`/src/main/R/ch8/stock_day_avg.R


#1 Specifies the map R script (the same script we ran in the previous Map-only technique).

#2 Sets the reduce R script.

We can perform a simple cat that shows us that the output is identical to what we produced when calling the R script directly.

$ hadoop fs -cat output/part*
AAPL 68.997
CSCO 49.94775
GOOG 123.9468
MSFT 101.297
YHOO 94.55789

You may have noticed that we used TextInputFormat, which emits a key/value tuple where the key as the byte offset in the file and the value as the contents of a line. However, in our R script, we were only supplied the value part of the tuple.

This is an optimization in Hadoop streaming, where if it detects we are using TextInputFormat it ignores the key from the TextInputFormat. If you want the key supplied to your script, then you can set the "stream.map.input.ignoreKey" Hadoop configuration to true.

Figure 4 shows some Streaming configuration settings that can be used to customize Map inputs and outputs.

Figure 4 Streaming configurations for Reduce tasks

What if the Map output values need to be supplied to the Reducer in a specific order for each Map output key (called secondary sort)? Secondary sort in Streaming can be achieved by using the KeyFieldBasedPartitioner, as shown below.

$ export HADOOP_HOME=/usr/lib/hadoop

$ ${HADOOP_HOME}/bin/hadoop fs -rmr output

$ ${HADOOP_HOME}/bin/hadoop fs -put test-data/stocks.txt stocks.txt

$ ${HADOOP_HOME}/bin/hadoop \
  jar ${HADOOP_HOME}/contrib/streaming/*.jar \
  -D stream.num.map.output.key.fields=2 \                               #1
  -D mapred.text.key.partitioner.options=-k1,1\                         #2
  -inputformat org.apache.hadoop.mapred.TextInputFormat \
  -input stocks.txt \
  -output output \
  -mapper `pwd`/src/main/R/ch8/stock_day_avg.R \
  -reducer `pwd`/src/main/R/ch8/stock_cma.R \
  -partitioner \                                                        #3
      org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
  -file `pwd`/src/main/R/ch8/stock_day_avg.R


#1 Specifies that Streaming should consider both the stock symbol and date to be part of the map output key.

#2 Specifies that MapReduce should partition output based on the first token in the Map output key, which is the stock symbol.

#3 Specifies the partitioner for the job, KeyFieldBasedPartitioner, which will parse the mapred.text.key.partitioner.options to determine what to partition.

For additional Streaming features, such as more control over sorts, please look at the Hadoop streaming documentation.

Summary

The fusion of R and Hadoop allows for large-scale statistical computation, which becomes all the more compelling as both our data sizes and analysis needs grow. In this article, we focused on one of three approaches that can be used to combine R and Hadoop together-R and Streaming.


Here are some other Manning titles you might be interested in:

Hadoop in Action

Hadoop in Action
Chuck Lam

Mahout in Action

Mahout in Action
Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman

Tika in Action

Tika in Action
Chris Mattmann and Jukka L. Zitting


AttachmentSize
cover.png33.16 KB
image004.png33.61 KB
image005.png27.96 KB
image006.png29.47 KB
manninglogo.png9.67 KB
image001.png50.78 KB
image003.png0 bytes
image002.png47.31 KB
image007.png32.2 KB
Related Topics >>