Skip to main content

My Strata Wednesday

Posted by haroldcarr on February 29, 2012 at 10:26 PM PST

Wednesday February 29, 2012

1 8:50am The Apache Hadoop Ecosystem Doug Cutting (Cloudera)

Context: exponential for decades

  • abundance of computing, storage, generated data
  • peta-scale now affordable
  • traditional data doesn't scale well
  • more data provides greater value
  • time for a new approach

New approach

  • traditional
    • exotic/expensive hardware (SAN, RAID)
  • big data
    • commodity/unreliable HW
    • reliability at software level
    • scales further

New software

  • traditional
    • monolithic
    • schema first
    • proprietary
  • big data
    • distributed (storage and compute)
    • raw data (optionally do schemas dynamically later)
    • open source


  • Hadoop is kernel (de facto industry standard)
  • around it : Pig, Hive, Flume, …


  • no strategic agenda - quality emergent
  • community-based - decisions by consensus/transparent
  • allows competing projects
  • loose federation of projects - permits evolution
  • ensures against vendor lock-in - can't buy Apache

Typical adoption pattern

  • idea that impractical without Hadoop
  • build Hadoop-based proof of concept
  • move initial app to production
  • add more datasets and users
    • removing silos in organizations
    • permits easy experiments on real data
  • snowballs into institutions central repository
    • analysis
    • data processing


  • what data are you ignoring?
  • how can you use it?
  • how can you combine your data with others?

2 9:00am Do We Have The Tools We Need To Navigate The New World Of Data? Dave Campbell (Microsoft)

Microsoft now supports Hadoop (since last September)

data as a platform

         |             insights and action
value    |           knowledge
         |         information
         |      data
         |  signal

reduce the time to insight

         |             insights and action
value    |           knowledge
         |         information
         |      data
         |  signal
       /          refine
   / combine
reduce the time to insight

Azure Labs: Data Explorer


  • search/acquire
  • explore/analyze
  • explain/share

Powerview (will ship with new SQL Server)

3 9:10am Decoding the Great American ZIP myth Abhishek Mehta (Tresata)


  • William Levitt - father of American suburb (levittown)
  • Henry Ford - any color you want as long as its black
  • but people inside are different

The tools to build a better financial system are here

  • can be used for other purposes too
  • commodization of data stack is complete (and free/open source)

Hal Varian

unlimited storage, bandwidth, processing

  • what problem will you solve?

common platform to store, process, analyze, visualize data

(most of) the best minds of our generation are thinking about how to make people click ads …

4 9:20am Guns, Drugs and Oil: Attacking Big Problems with Big Data Mike Olson (Cloudera)



Single-Nucleotide Polymorphism

  • Tools: Bowtie and Crossbow (related SoapSNP, Contrail, CloudBurst)


  • Predictive Policing (Santa Cruz, CA: put cops where it matters)
    • imaging adding tweeter feeds, etc
  • Entity Analysis
    • machine learning and social networking applied to drug trafficking and Terrorism
    • who knows whom? what do they talk about?


  • reflection seismology
    • subsurface topography
    • data analysis/modeling to produce subsurface structure and reservoir maps

The ability to use data to solve important social problems.

5 9:30am Machine Learning and Big Data: Sustainable Value or Hype? Flavio Villanustre (LexisNexis Risk Solutions and HPCC Systems)

HPCC (LexisNexis' previously internal proprietary system)

  • open source distributed Big Data analytics platform

collection, ingest, discovery/cleanse, integration, analysis, delivery/visualization

how to extract value from data?

  • machine learning : ECL-ML: HPCC machine learning
  • correlation, classifiers, clustering, statistics, …

6 9:35am Learning Analytics: What Could You Do With Five Orders of Magnitude More Data About Learning? Steve Schoettler (Junyo)

EdTech (educational technology)

Immediate feedback is most important element in student achievement.

technology in classroom

  • apply big data to "click stream" from student to analyze and give feedback

7 9:40am A Big Data Imperative: Driving Big Action Avinash Kaushik (Market Motive)

Blog: occam's razor

how we use data will define us

data democracy: build tools so people can make data decisions directly (rather than placing "data intermediaries" between end users and data).

  • known knowns
  • known unknowns
  • unknown unknowns (biggest problem = e.g., Rumsfield's excuse for screwing up)

math in the service of humanity

8 9:55am The Information Architecture of Medicine is Broken, Ben Goldacre (Bad Science)

9 10:40am Exploring Social Data: Use Cases for Real-World Application, Chris Moody (Gnip)

The decision that changed everything

  • twitter: Q3 2010
    • enabled commercial-grade $ access to their data (rather than just previous consumer APIs)
  • now others too
    • twitter, facebook, delicious, newsgator, youtube, g+, myspace, StockTwits, flickr, StumbleUpon, Dailymation, WordPress, disqus, …

Reaction time

  • faster/slower
    • twitter, facebook, g+, youtube, wordpress, disqus/intensedebate


  • deep/concise
    • youtube, wordpress, Disqus, g+, facebook, twitter
deep    |                     Product Development
        | Customer Service
        |                    Brand Management
        |   Supply Chain
concise | PR
         faster         slower

Expected vs Unexpected Events

  • expected: ramp up, peak, ramp down
  • unexpected: spike, ramp down


ESRI and Local Retailing

  • what are people saying in store
  • how is inventory impacted by social data?
  • want to know photos people taking inside store (e.g, dirty bathroom, bad/broken display)
  • when people steal they brag about it

NetBase, JD Power and Tropicana

  • sources: twitter, blogs, comments
  • characteristics: concise for signal, deep for insight
  • concise: baby boomers like orange juice
  • deep: Tropicana/orange juice associated with reward (in millenials)
  • action: put Tropicana vending machine outside gyms/health clubs

Industrial Parts Supplier

  • sources: twitter, blogs, comments
  • characteristics: coverage
  • where is the next factory/Walmart going to be built?
  • city council people tweeting, meeting minutes, …

VisionLink and Boulder fire

  • sources: twitter, flickr
  • characteristics: fast, geodata
  • new source of images/info about fire
  • gives responders a new view of where to focus attention

What is the right social cocktail to solve your business problem?

Mining a new source

  • Disqus
  • largest 3rd party commenting platform on web

10 11:30am Business Management Strategies for Big Data, Dave Rubin (Oracle)

runs NoSQL development team at Oracle

what is big data

  • velocity - high rates on incoming/temporal
  • volume : vast quantities
  • variety : (un)structured data

common bigdata tech

  • NoSQL DB
    • dynamic/rapidly changing schema
    • predictable/bounded low latency store
  • MapReduce
    • breakup problem in smaller sub-problems
    • Hadoop
  • HDFS - Hadoop Distributed File System
    • distributed, scalable storage
    • write once, read many times
  • Hive - query
  • HBase - non-relational DB like Google's BigData
  • R - language for statistical analysis
  • Pig - program MapReduce
  • Sqoop - SQL<->Hadoop


  • predictive ad and content generation
  • data warehousing at facebook
    • hadoop/hive warehouse
    • 48000 cores
    • 12 TB per node, storage capacity of 5.5 PetaBytes
    • two level network topology
    • apps
      • reporting : measures of user engagement, microstrategy dashboards
      • ad hoc analysis
      • machine learning : predictive advertising
  • banking
    • hadoop as engineered shared service
    • lines of business using hadoop
    • two level usage
      • increase revenue: customer intelligence, sentiment analysis
      • reduce cost: fraud intelligence, risk mgmt, …

web is loaded with predictive signals

recorded futures - selling big data as a service

  • predict trading volumes
  • predict returns from sentiment
  • predict volatility

oracle products

  • bigdata appliance
    • hadoop, NoSQL store, RDBS<->hadoop loader
    • exalytics - fast analytics/visualization
    • R

stream -> big data appliance –(infiniband)-> exadata –(infiniband)-> exalytics

11 1:30pm Building a Data Narrative: Discovering Haight Street, Jesper Andersen (Bloom Studios)

data visualization design like game design

data narration (using statistics, programming, visualization, expression)

don't give users scores, given them stories

example: what is Haight Street like?

  • named after: wikipedia conflicts with SF municipal record

voroni tesselation

RapLeaf - name to gender service

Naive Bayes - predictor of positive/negative tweets

maximum entropy model to find what people are saying (with geodata)

instagram: what to people remember?

12 2:20pm Amazon DynamoDB: A seamlessly scalable NoSQL service, Swaminathan Sivasubramanian, (Amazon Web Services)

launched one month ago


  • app -> server -> DB
  • app -> load-balanced servers -> DB (keep making bigger box for single DB)
  • app -> load-balanced servers -> distributed DB

start with scale out as guiding principle

fully managed NoSQL data store

  • minimal admin
  • low latency SSDs
  • unlimited potential storage/throughput

provisioned throughput

  • how much read/write capacity
    • not in terms of servers and disk IO
  • increase/decrease any time
  • while app online


  • default : eventually consistent
  • choice: strongly consistent (cost 2x eventually)


  • durable
    • all writes to disk (not memory)
    • acked when it exists in two physical data centers
  • availability
    • data replicated to multiple zones


  • specify primary key when create
    • hash on single attribute
    • or, composite
  • add more columns/rows any time


  • latency : single digit millisecond
  • SSD-backed
  • consistent as throughput/storage grows
  • no need to tune

Elastic Map Reduce

  • pay as you go Hadoop
  • Hive integration with EMR
  • filters pushed down to DB
  • built in table throughput aware query engine
  • use cases
    • archive
    • data load
    • complex queries


  • amazon cloud drive, elsevier, smugmug, amazon, shazam, formspring, tapjoy

13 4:00pm Roll Your Own Front End: A Survey of Creative Coding Frameworks, Michael Edgcumbe (Columbia University), Eric Mika (The Department of Objects)

WANCS : web are not computer scientists


  • Processing
    • built with Java
  • OpenFrameworks
    • built in C++
  • Raphael
  • pocode
    • build in C++
  • d3.js
    • toolbox
  • flash/air
  • cinder
    • in C++
  • PhiloGL
    • wrapper over


  • Dashboard vs Thumbprints
  • libraries
  • platform UI + Custom UI
  • novel IO


  • Data Sit
  • pentaho
  • tableau


  • oak ridge nat labs: flocking
  • movie fingerprints: cinemetrics

Stages of Viz

  • gather, clean, feed, group
  • access, iterate, calculate geometry, display, interact, update, repeat

above frameworks work with

  • linux, windows, max, android, iOS
  • firefox, IE, chrome, safari

what is powering graphics

  • OpenGL (except flash)


  • best Processing, OpenFrameworks


  • cinder ; if you know c++
  • know lots of stuff: openframeworks
  • don't know how to code : processing
  • canonical stuff to publish to web: d3.js
  • blending: processing

14 4:50pm Linked Data: Turning the Web into a Context Graph, Leigh Dodds, (Kasabi)


identity is hard

  • labeling
  • determining equivalence
  • e.g., street address vs lat/long

identifier used to "route" from one data set into another

use web (i.e., URI) for global identifiers

free ebook: Linked Data Patterns Leigh Dodds, Ian Davis

15 5:30pm Expo Hall Reception

16 6:30pm Strata 2012 Startup Showcase



Related Topics >>