# OSCON Wednesday

Posted by haroldcarr on July 27, 2011 at 8:00 PM PDT

# My OSCON Wednesday, 07/27/2011

## 1 10:40am - Essential Data Analysis Workshop

attendance: 32

### 1.1 Part 1

#### 1.1.1 Univariate distributions

• one variable - a set of points - each measures the same quantity
• plot: quantity/Y of points at certain value/X
• critical: determine shape of point distribution
• that tells what methods are applicable
##### 1.1.1.1 histograms and kernel density estimates
• histogram
• chop range of values/X into partitions of fixed size - then Y/# in bin
• choose bin width, not smooth, not unique (must choose bin anchoring)
• problem with histogram: shape of distribution should not depend on arbitary parameter
• kernel density estimates (KDE)
• modern alternative to histograms
• localized, variable width, normalized (area under each curve the same), smooth, unique, must choose bandwidth
• gaussian function to adjust bandwidth
• large bandwidth: smooth but little detail
• small bandwidth: noisy, but more faithful to data
• IQR …
##### 1.1.1.3 outliers and outlier detection
• a point that is different from the other points
• far away from typical value
• how to measure far? - via distribution of points themselves
##### 1.1.1.4 power-law distribution
• a distribution that has no shape
• use logarithmic graph - makes info visible
• power law: 1/x(1.87)
• example: earthquakes, length of rivers, stock market, …
• properites: no location nor scale; standard statistical methods do not work; occur in real world; use double-logarithmic plots to identify
• do not have good understanding into how they occur nor how to handle
• more points do not mean better accuracy; average never settles
##### 1.1.1.5 cumulative distribution functions
• Alternative to density estimate but has different use
• cumulative shows fraction of points to left of current location : always increases; always smoother; useful for comparing;
• intuitive idea of shape idea of shape of function; cumulative gives quantitive idea
• properties: monotonic; unique; smoother than histogram/density; normalized to 1; use for comparison of two distribution; use to estimate tail weights

skipped

#### 1.1.2 Resampling Methods

Ex: two suppliers provide some widget. How good are they suppliers? With sample size of 100 one finds 10% and 12% defect rate respectively. But that is just a point estimate - not good enough. If we repeat will we always find same defect rate? But probably can't repeat (i.e., cost).

##### 1.1.2.1 Bootstrap
• treat sample as representation of the system
• generate bootstrap samples by sampling with replacement from the original sample
• calculate desired quantity for each bootstrap sample
• use the scatter of values to develop confidence interval for desired quantity
• summary: synthetic samples give confidence; original sample needs to be reasonably large and clean
##### 1.1.2.2 variants
• variant: Jackknife: "leave-one-out" resampling
• variant: parametric bootstrap: for a kernel density estimate, generate bootstrap sample from that (rather than from the original sample directly)
• Summary of resampling methods: turn point estimates into interval estimates: work if sample is decent; do not really on assumptions or analytic arguments; can be applied to non-parametric problems

### 1.2 Part 2

#### 1.2.1 Time Series and Time-Series Analysis

• Ex: airline passengers per month: seasonality, trending up
• Ex: Nikkei Index: trend up until 1990 then very different
• Ex: call per day to call center: no trend, but areas and outliers; weekly periodicity
##### 1.2.1.1 Concepts
• components: trend; seasonality; noise; other (e.g., outliers)
• tasks: description (past); forecasting (future); control (presence)
• requirements: equal-size increment; no missing points; enough data points
##### 1.2.1.2 Moving ("floating") Averages
• task: smoothing to get rid of noise
• solution: given a smoothing interval, replace the center point with an unweighted average of adjacent points; the larger the smoothing interval, the smoother the result
• can lead to unrealistic artifacts (e.g., outliers cause jumps in average)
• weighted: give greater weight to points near the center of the interval
• problems with moving averages: interior points only; no forecasting; computationally awkward (entire series must be available)
##### 1.2.1.3 Exponential Smoothing (Holt-Winters Methods)
• previous smoothed value is updated based on new data
• smoothed value is mixture of new "raw" data and previous "smooth" value
• three forms
• single : for data without trend
• double : for data with trend
• triple : data with trend and periodicity
• summary: simple; versatile; few assumptions; updating scheme easy to implement; use mixing parameter to balance smoothness versus faithfullness; single/double/triple depending on data set
##### 1.2.1.4 Outlier Detection in Time Series
• method
• 1: smooth the data to establish a baseline ("location")
• 2: find residual between baseline and data (subtract smooth value from data)
• 3: determine distribution (via kernel density estimate) of residual to find the "width"
• 4: use the cumulative distribution to establish "confience bands"
• 5: outliers differ from the baseline by more than the "width"
• alternative: use hot-winters smoothing (smooth the residual)
• advantage over outlier detection: works "up to the edge" (not just for interior points); adapts if the width of the residual changes
• special cases:
• if baseline is known and fixed then simpler (e.g., control problems like heated room with fixed temperature setting)
• if data is highly seasonal use "last year's value" as baseline
• summary
• visual inspection
• smoothing to establish baseline
• calculate residual
• distribution of residuals: "width"
• cumulative distribution of residuals: "confidence bands"
• smooth baseline to find outliers
##### 1.2.1.5 Trend Detection in Time Series
• Ex: fluctuations around a known setpoint
• use: cusum chart: makes small change in setpoint visible

#### 1.2.2 Modeling

• the purpose of data analysis is to understand the system, not the data
• data is means to an end
• helps to formulate ideas (inductive reasoning)
• used to validate models (deductive reasoning)
• inductive: observation -> pattern -> hypothesis -> theory
• deductive: theory -> hypothesis -> observation -> confirmation

##### 1.2.2.2 Height/Weight Proportions (A scaling argument)

… left the workshop at this point …

### 1.3 Part 3

#### 1.3.1 Graphs for Multi-Variate Problems

More than two variables.

## 2 Lunch with Eduardo Pelegri-Llopart

Chance had it that I had lunch with Eduardo Pelegri-Llopart. Eduardo was a prime mover behind GlassFish for many years. He recently moved to RIM.

Eduardo is at OSCON to speak about BlackBerry And Open Source, Really? Why You Should Care.

## 3 4:10pm - Hands On Mahout - Mammoth Scale Machine Learning

• Data: Analytics and Visualization
• Tags: mahout, large_data, hadoop, recommendation, classification, data_scientists, clustering, machine_learning, ml
• Robin Anil (Google), Ted Dunning (MapR Technologies)

attendance: 50

http://mahout.apache.org/

https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks#BooksTutorialsandTalks-Books

collection of machine learning algorithms

clustering

• fuzzy grouping based on "similarity"
• k-means; fuzzy k-means; mean shift; canopy; dirichlet
• similarity: distance measure: euclidean; cosine; toanimoto; manhatten

classification

• predict type of new item based on its features (e.g., cat versus dog)

## 4 5:00pm - CouchApps with CouchDB, JavaScript & HTML5

• Javascript & HTML5
• Tags: rest, document-oriented_database, http, json, html5, nosql, javascript, couchdb

attendance: 50

http://couchdb.apache.org/

couchdb

• database
• PROS
• schema-less JSON
• REST API
• MapReduce views to index your data
• replication
• run on server to mobile device
• CONS
• learning curve
• no TX across doc boundaries
• eventual consistency
• web server
• app server

mapping document titles

``function(doc) {  if (doc.title) {    emit(doc.title)  }}``

same-origin policy: HTML + AJAX must come from same origin, therefore couchdb must serve HTML

couchapps

• streamline codebase - no middle tier
• same language on client and service (javascript)
• generate HTML from couchdb
• replicate data/code together

ground computing

• replication filters allow you to replicate relevant data to a user
• local data -> fast access
• offline access
• data portability
• decentralization
• user has more control

HTML5 web storage API instead of couchdb

• persistent key/value pairs
• cross browser
• no indexed queries
• no replication

uses

• mobile apps that require offline access
• multi-device apps
• peer-to-peer
• distributed social networking
• any app that stores docs
• geospatial (GeoCouch)

tools

design documents

• each couchapp live in design document
• live alongside other docs
• a db can have multiple couchapps
• contains
• views
• show/list functions
• doc update handlers
• validation
• rewrite defs
• attachments

CouchApp Wiki: pages

## 5 Other sessions I did not attend but find interesting

### 5.1 Your Personal Data Locker

• Open Data
• Tags: telehash, data_ownership, personal_data, locker_project
• Jeremie Miller (Singly)

http://lockerproject.org/

### 5.2 Learning Nuts & Bolts of Java EE 6 in a Code Intensive Tutorial

• Java: Server
• Tags: programming, javaee6, netbeans, java, glassfish, eclipse, enterprise
• Arun Gupta (Oracle)

### 5.3Ganeti Web Manager: Cluster Management Made Simple

• Tags: drbd, lvm, cloud, virtualization, linux, ganeti, kvm, django, operations, xen, python
• Lance Albertson (Oregon State University Open Source Lab), Peter Krenesky (Open Source Lab)

### 5.4 Cook Up a Data Mashup on the Fly with Infochimps

• Data: Roulette
• Tags: mash_up, data_api, api, data-in-the-cloud, data_in_the_cloud, mashup
• Dhruv Bansal, Winnie Hsia (Infochimps)

Tags: bof

from eBay

### 5.6 Extreme code reuse with GWT and Maven shade plugin

• Moderated by: Curtis Lee Fulton

Related Topics >>