Skip to main content

My Strata Thursday

Posted by haroldcarr on March 1, 2012 at 8:30 PM PST

Thursday March 1, 2012

1 8:50am Democratization of Data Platforms, Jonathan Gosier (metaLayer Inc.)

2 9:05am 5 Big Questions about Big Data, Luke Lonergan (Greenplum, a division of EMC)

BigData opens door to new approach to engaging customers and making decisions.

Data that was formerly discarded is now mined.

Scale everything: storage, compute, analytics, interaction.

3 9:15am The Trouble with Taste, Coco Krumme (MIT Media Lab)

wine tasting experiment: social influence, content & value, expectation & perception

wine aroma wheel versus mass spectrometer versus human nose

4 9:25am Embrace the Chaos, Pete Warden (Jetpac)

hard to access structured data (e.g, Amazon's product catalog) in the wild (unless you pay for it)

what was previously data "exhaust" (i.e., unstructured data) is now valuable (and generally free or very low cost)

ask yourself: what would Google do?

look for patterns

"forgive" human errors, noise, poor grammar/spelling, …

what data sources do you have? (e.g., support emails, invoices, tweets)

5 9:35am Open Data and the Internet of Things, Usman Haque (

machine to machine

  • gas company meter reading to compute bill (purpose specific)
  • closed

internet of things

  • light bulbs, printers, cars, people, smart phones, people, appliances
  • data open to be used in other contexts
    • e.g., cell phone to tower info : tomtom traffic patterns

crowd-sourced data - e.g., radiation feeds

Internet of Things Bill of Rights (2011)

6 9:45am Big Data’s Next Step: Applications, Gary Lang (MarkLogic)

  • 2001 - MarkLogic founded - queries against (un)structured data in repository
  • 2003 - XML in CMS with search
  • 2006 - Hadoop
  • 2007 - poly-structured
  • 2011 - BigData (same thing MarkLogic has always done)
  • any data, volume, structure
  • analyze everything all the time in real-time
  • keep all data in commodity store and slurp into hadoop when needed

7 9:50am Dr. Richard Merkin, President and CEO of Heritage Provider Network, Announces the Winner of the Second Heritage Health Progress Prize, Richard Merkin (Heritage Provider Network)

healthcare data mining to improve people's life

8 Start-up showcase winners

judge winners

  • Tokutek - MySQL Speed, scalability agility
  • Lex Machina - data from patent litigation to interpret and make predictions
  • memsql - accelerated data
  • bitdeli - …

audience winner

  • metaLayer - drag and drop discovery
    • delv - take real-time data streams and mashup

9 9:55am Using Google Data for Short-term Economic Forecasting, Hal Varian (Google)

10 10:40am Mining the Eventbrite Social Graph for Recommending Events, Vipul Sharma (Eventbrite)



  • search - solr
  • recommendation - hadoop, native MapReduce, bash
  • persistence - MySQL, HDFS, HBase, MongoDB (investigating Cassandra and Riak)
  • stream - RabbitMQ (investigating Kafka)
  • offline - MapReduce, Streaming, Hive, Hue

Infrastructure - Sqoozie

  • workflow for mysql imports to HDFS
    • generate sqoop commands , run imports in parallel
  • transparent to schema changes
  • include/exclude column, data types, table
  • data type casting
  • distributed table imports

Infrastructure - Blammo

  • raw logs import to HDFS via flume
  • 5 min latency
  • logs are key/value in json
  • each log publishes schema in yaml

Recommendation system

recommendation engines

  • item hierarchy - you bought camera, need batteries
  • collaborative - people who bought camera also bought…
  • collaborative item-item similarity - you like Godfather so would like …
  • social graph based - your friends liked …
  • interest graph based - your friends who like rock music like you are attending …

why interest?

  • events are social
  • interest are changing
  • dense graph is irrelevant (need segmentation)

how know interest?

  • ask you
  • based on activity (attended, browsed)
  • facebook
  • machine learning
    • logistic regression using MLE
    • sparse matrix generated via MapReduce
    • model for each interest


  • model based vs clustering
  • item-item vs user-user
  • building social graph is clustering step
  • social graph recommendation is a ranking problem

implicit social graph

u = user
e = event

             /  \
            e1   \
           /      \
          u2       u3
         /  \
       e2    e3
       /      \
     u4        u5
  • mixed features
  • series of map-reduce jobs
  • output on HDFS in flat files; input to subsequent jobs
  • orders - event -> attendees
    • map eid: uid
    • reduce eid[uid]
  • attendees -> social graph
    • input eid[uid]
    • map uid[uid]
    • reduce …

Hbase (single source of truth)

  • collect data from multiple MR jobs
    • stores entire social graph
    • over one million writes per second
rowid neighbors events featureX
271 101 3 0.367879

tips and tricks

  • distributed cache as much as possible
    • sped up some MR jobs by hours
    • be sure to use counters
  • hive (use as much as possible)
    • "flip join" - join + processing/transformation
    • statistical functions using hive
    • UDF
  • memory memory memory
  • LZO, WAL
  • combiners are great until
  • shuffle and sorting stage
  • hadoop ecosystem is still new
    • optimal level of spill on disk vs jvm memory
    • significant amount of time doing debugging of hadoop itself

(best talk so far - nitty gritty details)

11 11:30am Pretty Simple Data Privacy, Kaitlin Thaney (Digital Science), Betsy Masiello (Google), John Wilbanks (Kauffman Foundation for Entrepreneurship)


  • designers making it easy to give away your data (especially from mobile apps)
  • privacy is about context, social situations and control
  • users are getting pissed
  • simplifying understanding of privacy for users (e.g., iconic representations)
  • genome sequencing : openSNP
  • 23andMe
  • design barriers to donate your data to science
  • Consent to Research - John Wilbanks' effort


  • make research more efficient viz privacy
  • opt-in service A conflicting with opt-out service B
  • de-anonimizers


Google's new privacy policy (in effect starting today: March 1)

  • notification effort started Jan 24 - and home page promos
  • previous 70 different policies - now 1 (but still has 6 other ones- e.g., google wallet)
  • treat you as single user across all products (HC: even if you don't want it)
  • to avoid: don't log in when doing search, etc (HC: so I have to logout of gmail before doing a search)
  • use "Privacy tools"
    • dashboard, ads preferences mgr, data liberation front, out-outs, encrypted search


12 1:30pm Data Jujitsu: The Art of Turning Data into Product, DJ Patil (Greylock Partners)

what is a data product (facilitates end goal thru use of data)?

philosophy for data

  • Jujitsu: the art of softness - defeat armed opponents without using weapons (use their energy)
  • use light-weight data to try things out (instead of gigantic design in advance)

build data products as a progression

  • what data do you start with: (un)structured; can you switch that ratio
  • cleanup : un -> structured
  • disambiguate by asking user (move hard backend problem to easy frontend problem)
  • human augmentation is key
  • build easy products first
  • be opportunistic for wins - e.g., people you may know
  • but yourself into the role of a physical surrounding (e.g., physical retail space)
  • giving back data is driver - e.g., who's viewed your profile, viewers by geography
  • data vomit is bad - too much info causes click-through-rate (CTF) to drop off
  • exposing data challenges - e.g., type-casting user
  • set user expectations - set your users up for success (e.g., pandora)
  • hard to test outside of production - need humans to look
  • have to win within 500ms
  • know when to build the serious stuff
    • e.g., post a job and get people recommend ("pandora for people")

13 2:20pm Connecting Millions of Mobile Devices to the Cloud, James Phillips (Couchbase, Inc.)

     schema evolution
        scalability of consolidated store
selection performance
                     bandwidth conservation
referential integrity on interruption;
           battery and memory conservation
delete propagation;
             which data - temporal, spatial, user?
new user provisioning;
             conflict detection and resolution

multi-tier, scalable type 2 sync architecture:

no sql ; search
data synchronization
load balancer
web app

data infrastructure

external data
big data > analysis
            no sql > web app
            sync > mobile

14 4:00pm Open Source Ceph Storage– Scaling from Gigabytes to Exabytes with Intelligent Nodes, Sage Weil (new dream network / dreamhost)


  • requirements, save time/money


  • diverse needs
    • object storage, block devices for snapshots/cloning, shared file, structured data
  • scale
    • heterogeneous hardware, reliability, fault tolerance


  • ease of admin
  • no manual data migration, load balancing
  • painless scaling (up/down)


  • low cost per gigabyte
  • no vendor lock-in
    • software solution on commodity hardware
    • open source

what is ceph?

unified storage system : distributed storage system :data center scala, FT, commodity hardware

  • objects (big/small)
  • block devices
  • distributed file system


  • LGPLv2
  • no dual licensing

storage device coordinate (i.e., intelligence) so clients do not have to

data distribution

  • objects replicated N times
  • auto placed, balanced, migrated
  • smart about physical infrastructure (e.g., no duplication on same rack)


  • pseudo-random placement algorithm
  • rules: e.g., 3 replicas, same row, different racks
  • predictable bounded migration
  • map update (e.g., new nodes, failure, downgrade) potentially triggers data migration

distributed file system

  • cluster-coherent
  • separate metadata and data paths
  • dynamic subtree partitioning
  • move work from busy to idle servers

best tool for the job


  • cassandra, riak, redis

object store

  • rados (ceph)


  • gfs, hdfs


  • cepth, lustre, gluster


  • tablets, logs


  • data flows


  • triggers, transactions


  • limited options for scalable open source storage
    • orangefs, lustre, glusterfs, HDFS
  • proprietary
    • hardware+ software
  • industry needs to change


  • created a UC Santa Cruz (2007) as his PhD
    • thesis : (pdf)
  • supported by DreamHost 2008-2011
  • new company 2012
  • growing community
  • they are hiring: C/C++/Python, sysadmins, testing engineers

15 4:50pm Mapping social media networks (with no coding) using NodeXL, Marc Smith (Social Media Research Foundation)


  • firefox of GraphML
  • make network charts as easy as making a pie chart
  • connect researchers to social media sources

Open Tools

Open Data

Open Scholarship

social media is all about connections from people to people and exchanges between them

patterns are left behind

like, link, reply, rate, review, favorite, friend, follow, forward, edit, tag, comment, check-in, …

the strength of weak ties (in aggregate)

social networks: 1934: Jacob L. Moreno

  • sociogram of a football team
  • look for hubs
  • bridge : in what way is a person with only two connections more "important" than one with hundreds?
  • clusters : sub-communities
  • crowds :
  • isolates :
  • Gephi - photoshop for graphs
  • NodeXML - like MSPaint for graphs

Social Network Theory

article: Visualizing the signatures of social roles in online discussion groups

  • taxonomy of types of people


Related Topics >>