OSCON Wednesday
My OSCON Wednesday, 07/27/2011
Table of Contents
- 1 10:40am - Essential Data Analysis Workshop
- 2 Lunch with Eduardo Pelegri-Llopart
- 3 4:10pm - Hands On Mahout - Mammoth Scale Machine Learning
- 4 5:00pm - CouchApps with CouchDB, JavaScript & HTML5
- 5 Other sessions I did not attend but find interesting
- 5.1 Your Personal Data Locker
- 5.2 Learning Nuts & Bolts of Java EE 6 in a Code Intensive Tutorial
- 5.3 Ganeti Web Manager: Cluster Management Made Simple
- 5.4 Cook Up a Data Mashup on the Fly with Infochimps
- 5.5 Turmeric : Bring some spice to the Service Oriented Architecture
- 5.6 Extreme code reuse with GWT and Maven shade plugin
1 10:40am - Essential Data Analysis Workshop
- Data: Analytics and Visualization
- Tags: data_scientists
- Philipp Janert (Principal Value, LLC)
attendance: 32

1.1 Part 1
1.1.1 Univariate distributions
- one variable - a set of points - each measures the same quantity
- plot: quantity/Y of points at certain value/X
- critical: determine shape of point distribution
- that tells what methods are applicable
1.1.1.1 histograms and kernel density estimates
- histogram
- chop range of values/X into partitions of fixed size - then Y/# in bin
- choose bin width, not smooth, not unique (must choose bin anchoring)
- problem with histogram: shape of distribution should not depend on arbitary parameter
- kernel density estimates (KDE)
- modern alternative to histograms
- localized, variable width, normalized (area under each curve the same), smooth, unique, must choose bandwidth
- gaussian function to adjust bandwidth
- large bandwidth: smooth but little detail
- small bandwidth: noisy, but more faithful to data
1.1.1.2 location and scale
- IQR …
1.1.1.3 outliers and outlier detection
- a point that is different from the other points
- far away from typical value
- how to measure far? - via distribution of points themselves
1.1.1.4 power-law distribution
- a distribution that has no shape
- use logarithmic graph - makes info visible
- power law: 1/x(1.87)
- example: earthquakes, length of rivers, stock market, …
- properites: no location nor scale; standard statistical methods do not work; occur in real world; use double-logarithmic plots to identify
- do not have good understanding into how they occur nor how to handle
- more points do not mean better accuracy; average never settles
1.1.1.5 cumulative distribution functions
- Alternative to density estimate but has different use
- cumulative shows fraction of points to left of current location : always increases; always smoother; useful for comparing;
- intuitive idea of shape idea of shape of function; cumulative gives quantitive idea
- properties: monotonic; unique; smoother than histogram/density; normalized to 1; use for comparison of two distribution; use to estimate tail weights
1.1.1.6 rank-order plots
skipped
1.1.2 Resampling Methods
Ex: two suppliers provide some widget. How good are they suppliers? With sample size of 100 one finds 10% and 12% defect rate respectively. But that is just a point estimate - not good enough. If we repeat will we always find same defect rate? But probably can't repeat (i.e., cost).
1.1.2.1 Bootstrap
- treat sample as representation of the system
- generate bootstrap samples by sampling with replacement from the original sample
- calculate desired quantity for each bootstrap sample
- use the scatter of values to develop confidence interval for desired quantity
- summary: synthetic samples give confidence; original sample needs to be reasonably large and clean
1.1.2.2 variants
- variant: Jackknife: "leave-one-out" resampling
- variant: parametric bootstrap: for a kernel density estimate, generate bootstrap sample from that (rather than from the original sample directly)
- Summary of resampling methods: turn point estimates into interval estimates: work if sample is decent; do not really on assumptions or analytic arguments; can be applied to non-parametric problems
1.1.2.3 Examples
…
1.2 Part 2
1.2.1 Time Series and Time-Series Analysis
- Ex: airline passengers per month: seasonality, trending up
- Ex: Nikkei Index: trend up until 1990 then very different
- Ex: call per day to call center: no trend, but areas and outliers; weekly periodicity
1.2.1.1 Concepts
- components: trend; seasonality; noise; other (e.g., outliers)
- tasks: description (past); forecasting (future); control (presence)
- requirements: equal-size increment; no missing points; enough data points
1.2.1.2 Moving ("floating") Averages
- task: smoothing to get rid of noise
- solution: given a smoothing interval, replace the center point with an unweighted average of adjacent points; the larger the smoothing interval, the smoother the result
- can lead to unrealistic artifacts (e.g., outliers cause jumps in average)
- weighted: give greater weight to points near the center of the interval
- problems with moving averages: interior points only; no forecasting; computationally awkward (entire series must be available)
1.2.1.3 Exponential Smoothing (Holt-Winters Methods)
- previous smoothed value is updated based on new data
- smoothed value is mixture of new "raw" data and previous "smooth" value
- three forms
- single : for data without trend
- double : for data with trend
- triple : data with trend and periodicity
- summary: simple; versatile; few assumptions; updating scheme easy to implement; use mixing parameter to balance smoothness versus faithfullness; single/double/triple depending on data set
1.2.1.4 Outlier Detection in Time Series
- method
- 1: smooth the data to establish a baseline ("location")
- 2: find residual between baseline and data (subtract smooth value from data)
- 3: determine distribution (via kernel density estimate) of residual to find the "width"
- 4: use the cumulative distribution to establish "confience bands"
- 5: outliers differ from the baseline by more than the "width"
- alternative: use hot-winters smoothing (smooth the residual)
- advantage over outlier detection: works "up to the edge" (not just for interior points); adapts if the width of the residual changes
- special cases:
- if baseline is known and fixed then simpler (e.g., control problems like heated room with fixed temperature setting)
- if data is highly seasonal use "last year's value" as baseline
- summary
- visual inspection
- smoothing to establish baseline
- calculate residual
- distribution of residuals: "width"
- cumulative distribution of residuals: "confidence bands"
- smooth baseline to find outliers
1.2.1.5 Trend Detection in Time Series
- Ex: fluctuations around a known setpoint
- use: cusum chart: makes small change in setpoint visible
1.2.2 Modeling
- the purpose of data analysis is to understand the system, not the data
- data is means to an end
- helps to formulate ideas (inductive reasoning)
- used to validate models (deductive reasoning)
- inductive: observation -> pattern -> hypothesis -> theory
- deductive: theory -> hypothesis -> observation -> confirmation
1.2.2.1 Phone Calls (A Probabilistic model)
1.2.2.2 Height/Weight Proportions (A scaling argument)
… left the workshop at this point …
1.3 Part 3
1.3.1 Graphs for Multi-Variate Problems
More than two variables.
1.3.1.1 False-Color Plots
1.3.1.2 Co-Plots
1.3.1.3 Mosaic Plots
1.3.1.4 Parallel-Coordinate Plots
1.3.2 Feature Selection
1.3.3 Wrap-Up
2 Lunch with Eduardo Pelegri-Llopart
Chance had it that I had lunch with Eduardo Pelegri-Llopart. Eduardo was a prime mover behind GlassFish for many years. He recently moved to RIM.
Eduardo is at OSCON to speak about BlackBerry And Open Source, Really? Why You Should Care.
3 4:10pm - Hands On Mahout - Mammoth Scale Machine Learning
- Data: Analytics and Visualization
- Tags: mahout, large_data, hadoop, recommendation, classification, data_scientists, clustering, machine_learning, ml
- Robin Anil (Google), Ted Dunning (MapR Technologies)
attendance: 50

collection of machine learning algorithms
clustering
- fuzzy grouping based on "similarity"
- k-means; fuzzy k-means; mean shift; canopy; dirichlet
- similarity: distance measure: euclidean; cosine; toanimoto; manhatten
classification
- predict type of new item based on its features (e.g., cat versus dog)
…
4 5:00pm - CouchApps with CouchDB, JavaScript & HTML5
- Javascript & HTML5
- Tags: rest, document-oriented_database, http, json, html5, nosql, javascript, couchdb
- Bradley Holt (Found Line)

attendance: 50

couchdb
- database
- PROS
- schema-less JSON
- REST API
- MapReduce views to index your data
- replication
- run on server to mobile device
- CONS
- no ad-hoc queries
- learning curve
- no TX across doc boundaries
- eventual consistency
- PROS
- web server
- app server
mapping document titles
function(doc) {
if (doc.title) {
emit(doc.title)
}
}
same-origin policy: HTML + AJAX must come from same origin, therefore couchdb must serve HTML
change notification via long polling
couchapps
- streamline codebase - no middle tier
- same language on client and service (javascript)
- generate HTML from couchdb
- replicate data/code together
ground computing
- replication filters allow you to replicate relevant data to a user
- local data -> fast access
- offline access
- data portability
- decentralization
- user has more control
HTML5 web storage API instead of couchdb
- persistent key/value pairs
- cross browser
- no indexed queries
- no replication
uses
- mobile apps that require offline access
- multi-device apps
- peer-to-peer
- distributed social networking
- any app that stores docs
- geospatial (GeoCouch)
tools
- mustache.js - logic-less templates
- evently - events
design documents
- each couchapp live in design document
- live alongside other docs
- a db can have multiple couchapps
- contains
- views
- show/list functions
- doc update handlers
- validation
- rewrite defs
- attachments
CouchApp Wiki: pages
…
5 Other sessions I did not attend but find interesting
5.1 Your Personal Data Locker
- Open Data
- Tags: telehash, data_ownership, personal_data, locker_project
- Jeremie Miller (Singly)

5.2 Learning Nuts & Bolts of Java EE 6 in a Code Intensive Tutorial
- Java: Server
- Tags: programming, javaee6, netbeans, java, glassfish, eclipse, enterprise
- Arun Gupta (Oracle)

5.3 Ganeti Web Manager: Cluster Management Made Simple
- Operations & System Administration
- Tags: drbd, lvm, cloud, virtualization, linux, ganeti, kvm, django, operations, xen, python
- Lance Albertson (Oregon State University Open Source Lab), Peter Krenesky (Open Source Lab)
5.4 Cook Up a Data Mashup on the Fly with Infochimps
- Data: Roulette
- Tags: mash_up, data_api, api, data-in-the-cloud, data_in_the_cloud, mashup
- Dhruv Bansal, Winnie Hsia (Infochimps)

5.5 Turmeric : Bring some spice to the Service Oriented Architecture
Tags: bof

from eBay
5.6 Extreme code reuse with GWT and Maven shade plugin
- Moderated by: Curtis Lee Fulton

- Login or register to post comments
- Printer-friendly version
- haroldcarr's blog
- 1335 reads






