Skip to main content

OSCON Wednesday

Posted by haroldcarr on July 27, 2011 at 8:00 PM PDT

My OSCON Wednesday, 07/27/2011


1 10:40am - Essential Data Analysis Workshop

attendance: 32

http://www.principal-value.com/images/da-cat.gif

1.1 Part 1

 

1.1.1 Univariate distributions

  • one variable - a set of points - each measures the same quantity
  • plot: quantity/Y of points at certain value/X
  • critical: determine shape of point distribution
    • that tells what methods are applicable
1.1.1.1 histograms and kernel density estimates
  • histogram
    • chop range of values/X into partitions of fixed size - then Y/# in bin
    • choose bin width, not smooth, not unique (must choose bin anchoring)
    • problem with histogram: shape of distribution should not depend on arbitary parameter
  • kernel density estimates (KDE)
    • modern alternative to histograms
    • localized, variable width, normalized (area under each curve the same), smooth, unique, must choose bandwidth
    • gaussian function to adjust bandwidth
      • large bandwidth: smooth but little detail
      • small bandwidth: noisy, but more faithful to data
1.1.1.2 location and scale
  • IQR …
1.1.1.3 outliers and outlier detection
  • a point that is different from the other points
  • far away from typical value
  • how to measure far? - via distribution of points themselves
1.1.1.4 power-law distribution
  • a distribution that has no shape
  • use logarithmic graph - makes info visible
  • power law: 1/x(1.87)
  • example: earthquakes, length of rivers, stock market, …
  • properites: no location nor scale; standard statistical methods do not work; occur in real world; use double-logarithmic plots to identify
  • do not have good understanding into how they occur nor how to handle
  • more points do not mean better accuracy; average never settles
1.1.1.5 cumulative distribution functions
  • Alternative to density estimate but has different use
  • cumulative shows fraction of points to left of current location : always increases; always smoother; useful for comparing;
  • intuitive idea of shape idea of shape of function; cumulative gives quantitive idea
  • properties: monotonic; unique; smoother than histogram/density; normalized to 1; use for comparison of two distribution; use to estimate tail weights
1.1.1.6 rank-order plots

skipped

1.1.2 Resampling Methods

Ex: two suppliers provide some widget. How good are they suppliers? With sample size of 100 one finds 10% and 12% defect rate respectively. But that is just a point estimate - not good enough. If we repeat will we always find same defect rate? But probably can't repeat (i.e., cost).

1.1.2.1 Bootstrap
  • treat sample as representation of the system
  • generate bootstrap samples by sampling with replacement from the original sample
  • calculate desired quantity for each bootstrap sample
  • use the scatter of values to develop confidence interval for desired quantity
  • summary: synthetic samples give confidence; original sample needs to be reasonably large and clean
1.1.2.2 variants
  • variant: Jackknife: "leave-one-out" resampling
  • variant: parametric bootstrap: for a kernel density estimate, generate bootstrap sample from that (rather than from the original sample directly)
  • Summary of resampling methods: turn point estimates into interval estimates: work if sample is decent; do not really on assumptions or analytic arguments; can be applied to non-parametric problems
1.1.2.3 Examples

1.2 Part 2

 

1.2.1 Time Series and Time-Series Analysis

  • Ex: airline passengers per month: seasonality, trending up
  • Ex: Nikkei Index: trend up until 1990 then very different
  • Ex: call per day to call center: no trend, but areas and outliers; weekly periodicity
1.2.1.1 Concepts
  • components: trend; seasonality; noise; other (e.g., outliers)
  • tasks: description (past); forecasting (future); control (presence)
  • requirements: equal-size increment; no missing points; enough data points
1.2.1.2 Moving ("floating") Averages
  • task: smoothing to get rid of noise
    • solution: given a smoothing interval, replace the center point with an unweighted average of adjacent points; the larger the smoothing interval, the smoother the result
    • can lead to unrealistic artifacts (e.g., outliers cause jumps in average)
    • weighted: give greater weight to points near the center of the interval
  • problems with moving averages: interior points only; no forecasting; computationally awkward (entire series must be available)
1.2.1.3 Exponential Smoothing (Holt-Winters Methods)
  • previous smoothed value is updated based on new data
  • smoothed value is mixture of new "raw" data and previous "smooth" value
  • three forms
    • single : for data without trend
    • double : for data with trend
    • triple : data with trend and periodicity
  • summary: simple; versatile; few assumptions; updating scheme easy to implement; use mixing parameter to balance smoothness versus faithfullness; single/double/triple depending on data set
1.2.1.4 Outlier Detection in Time Series
  • method
    • 1: smooth the data to establish a baseline ("location")
    • 2: find residual between baseline and data (subtract smooth value from data)
    • 3: determine distribution (via kernel density estimate) of residual to find the "width"
    • 4: use the cumulative distribution to establish "confience bands"
    • 5: outliers differ from the baseline by more than the "width"
  • alternative: use hot-winters smoothing (smooth the residual)
    • advantage over outlier detection: works "up to the edge" (not just for interior points); adapts if the width of the residual changes
  • special cases:
    • if baseline is known and fixed then simpler (e.g., control problems like heated room with fixed temperature setting)
    • if data is highly seasonal use "last year's value" as baseline
  • summary
    • visual inspection
    • smoothing to establish baseline
    • calculate residual
    • distribution of residuals: "width"
    • cumulative distribution of residuals: "confidence bands"
    • smooth baseline to find outliers
1.2.1.5 Trend Detection in Time Series
  • Ex: fluctuations around a known setpoint
  • use: cusum chart: makes small change in setpoint visible

1.2.2 Modeling

  • the purpose of data analysis is to understand the system, not the data
  • data is means to an end
    • helps to formulate ideas (inductive reasoning)
    • used to validate models (deductive reasoning)
  • inductive: observation -> pattern -> hypothesis -> theory
  • deductive: theory -> hypothesis -> observation -> confirmation
1.2.2.1 Phone Calls (A Probabilistic model)
 
1.2.2.2 Height/Weight Proportions (A scaling argument)

… left the workshop at this point …

1.3 Part 3

 

1.3.1 Graphs for Multi-Variate Problems

More than two variables.

1.3.1.1 False-Color Plots
 
1.3.1.2 Co-Plots
 
1.3.1.3 Mosaic Plots
 
1.3.1.4 Parallel-Coordinate Plots
 

1.3.2 Feature Selection

 

1.3.3 Wrap-Up


2 Lunch with Eduardo Pelegri-Llopart

Chance had it that I had lunch with Eduardo Pelegri-Llopart. Eduardo was a prime mover behind GlassFish for many years. He recently moved to RIM.

http://pelegri.files.wordpress.com/2011/05/glassfishlogo-98_74.png webworks-98_139.png

Eduardo is at OSCON to speak about BlackBerry And Open Source, Really? Why You Should Care.


3 4:10pm - Hands On Mahout - Mammoth Scale Machine Learning

  • Data: Analytics and Visualization
  • Tags: mahout, large_data, hadoop, recommendation, classification, data_scientists, clustering, machine_learning, ml
  • Robin Anil (Google), Ted Dunning (MapR Technologies)

attendance: 50

http://mahout.apache.org/

http://mahout.apache.org/images/mantle-mahout.png

https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks#BooksTutorialsandTalks-Books

collection of machine learning algorithms

clustering

  • fuzzy grouping based on "similarity"
  • k-means; fuzzy k-means; mean shift; canopy; dirichlet
  • similarity: distance measure: euclidean; cosine; toanimoto; manhatten

classification

  • predict type of new item based on its features (e.g., cat versus dog)


4 5:00pm - CouchApps with CouchDB, JavaScript & HTML5

  • Javascript & HTML5
  • Tags: rest, document-oriented_database, http, json, html5, nosql, javascript, couchdb
  • Bradley Holt (Found Line)

http://covers.oreilly.com/images/0636920018407/cat.gif http://covers.oreilly.com/images/0636920018247/cat.gif

attendance: 50

http://couchdb.apache.org/

http://a.imagehost.org/0364/couchdb-wiki-logo-main.png

couchdb

  • database
    • PROS
      • schema-less JSON
      • REST API
      • MapReduce views to index your data
      • replication
      • run on server to mobile device
    • CONS
      • no ad-hoc queries
      • learning curve
      • no TX across doc boundaries
      • eventual consistency
  • web server
  • app server

mapping document titles

function(doc) {
  if (doc.title) {
    emit(doc.title)
  }
}

same-origin policy: HTML + AJAX must come from same origin, therefore couchdb must serve HTML

change notification via long polling

couchapps

  • streamline codebase - no middle tier
  • same language on client and service (javascript)
  • generate HTML from couchdb
  • replicate data/code together

ground computing

  • replication filters allow you to replicate relevant data to a user
  • local data -> fast access
  • offline access
  • data portability
  • decentralization
  • user has more control

HTML5 web storage API instead of couchdb

  • persistent key/value pairs
  • cross browser
  • no indexed queries
  • no replication

uses

  • mobile apps that require offline access
  • multi-device apps
  • peer-to-peer
  • distributed social networking
  • any app that stores docs
  • geospatial (GeoCouch)

tools

design documents

  • each couchapp live in design document
  • live alongside other docs
  • a db can have multiple couchapps
  • contains
    • views
    • show/list functions
    • doc update handlers
    • validation
    • rewrite defs
    • attachments

CouchApp Wiki: pages


5 Other sessions I did not attend but find interesting

 

5.1 Your Personal Data Locker

  • Open Data
  • Tags: telehash, data_ownership, personal_data, locker_project
  • Jeremie Miller (Singly)

http://lockerproject.org/

http://singly.com/images/singly.png http://lockerproject.org/images/tlp_web_logo.png

5.2 Learning Nuts & Bolts of Java EE 6 in a Code Intensive Tutorial

  • Java: Server
  • Tags: programming, javaee6, netbeans, java, glassfish, eclipse, enterprise
  • Arun Gupta (Oracle)

http://blogs.oracle.com/arungupta/resource/personal/arun-200x200.png

5.3 Ganeti Web Manager: Cluster Management Made Simple

  • Operations & System Administration
  • Tags: drbd, lvm, cloud, virtualization, linux, ganeti, kvm, django, operations, xen, python
  • Lance Albertson (Oregon State University Open Source Lab), Peter Krenesky (Open Source Lab)

5.4 Cook Up a Data Mashup on the Fly with Infochimps

  • Data: Roulette
  • Tags: mash_up, data_api, api, data-in-the-cloud, data_in_the_cloud, mashup
  • Dhruv Bansal, Winnie Hsia (Infochimps)

http://www.infochimps.com/images/infochimps-logo-b.png

5.5 Turmeric : Bring some spice to the Service Oriented Architecture

Tags: bof

https://www.ebayopensource.org/uploads/Turmeric/TurmericLogo.jpg

from eBay

5.6 Extreme code reuse with GWT and Maven shade plugin

  • Moderated by: Curtis Lee Fulton

http://code.google.com/webtoolkit/images/gwt-logo.png

Related Topics >>