Skip to main content

OSCON Tuesday

Posted by haroldcarr on July 26, 2011 at 5:37 PM PDT

Tuesday, 07/26/2011


1 9:00am - Git Foundations

  • Tools and Techniques
  • Tags: master_class, version_control, source_code_control, github, open_source, git, dvcs, vcs
  • Tim Berglund (August Technology Group, LLC), Matthew McCullough
    (Ambient Ideas, LLC)

attendance : 110

http://git-scm.com/images/header.gif

# List the current config
git config --global --list

# new config
git config --global user.name "foo.bar"
git config --global user.email "foo@bar.baz"

# see ~/.gitconfig
# can sync that file to other machines

# force files to be LF in the repo, even on Mac/Linux
git config --global core.autocflf input

# force windows to convert to platform on checkout
# and to LF on commit
git config --global core.autocrlf true

# editor
git config --global core.editor "emacs"

config levels (in precedence order - top highest):

  • local : config a setting in a .git repo
  • global : in the user's home dir
  • system : all users on the system
working    staging    repo
    <- checkout --------
    -add---->
              -commit-->

# new git working directory
cd <somewhere>
git init project1
cd project1
git status

# add a file to repo
echo "foo" > bar.txt
git status
git add bar.txt
git status
git commit -m "commit comment"

# edit bar.txt, then:
git status
# add is not "adding new" it is "add activity: change, move, delete, ..."
git add *.txt
git status
git commit -m "update"

# edit bar.txt, then (see: content addressable file system)
git add -p bar.txt
# verbose commit (give you lots of info)
git commit -v
# skip staging area and go from modified files to commiting them
git commit -a

# view what is modified but not stage
git diff
# view what is staged but not commited
git diff --staged
# view what is modifed or staged but not committed
git diff HEAD

# history
git log
git log --stat
git log -p
git log --diff-filter=A
git log --pretty=raw
git log -3

Use magit for emacs + git

ignored  untracked  tracked  tracked  tracked
                    unmod    moded    staged
  <-ignore
             ----------add------------->
                        --edit-->
                                --add-->
                      <--- checkout-----

# to ignore - this file anywhere in tree applies downward
# ! means to keep
# <foo>/ means directory
emacs .gitignore

*.log
*.tmp
target
output/
!special.log

2 10:40am - Taming the Big Data Fire Hose

  • Data: Real-Time and Streaming
  • Tags: real_time_traveler
  • John Hugg (VoltDB)

attendance: 25

http://voltdb.com/sites/all/themes/voltdb/logo.png

big data define

  • velocity
  • volume
  • variety

This talk about velocity

  • lot of independent things happening at high frequency
  • want to update some state based on those events
  • want to query that state in real time - usually pre-defined queries
  • you want to record into persistent store after analysis
  • usually on a budget

Use-case

  • finance trade, telco calls, micro tx, geo
                  dashboard
                     |
                     v
-> ... events -> velocity --cooked events--> analytic store/engine
                  engine                           (TB+)
                   (GB)

Velocity engine

  • validate
  • respond
  • count/aggregate
  • enrich

Started with H-Store (rethink RDBMS for 21st century). VoltDB
commercializes.

Workshop Tuesday at 5pm at Hilton.

Tables partitioned. Partitions put on machines. Stored procedures
ordered. Serialized to machines.

Concurrency by scheduling, not locking.


3 11:30am - Managing Thousands of Cloud Instances with Java

  • Java: Cloud
  • Tags: cloud_computing, java, server, database
  • Patrick Lightbody (wrote books -> jive/selenium -> gomez -> BrowserMob -> Neustar Webmetrics)

attendance: 25

http://www.jclouds.org/_/rsrc/1263492028549/config/app/images/customLogo/customLogo.gif

example use case: load test: Spin up N browsers in cloud and hit your service

  • AWS SDK for Java
  • Typica
    • supports EC2, SNS, SQS, SimpleDB, FPS, CloudWatch
  • jclouds
    • abstraction in front of lots of cloud vendors
    • API based on Google Guice

Tips

  • architect with pricing structure in mind
    • e.g., no data transfer charge between EC2 and other AWS in same region
  • minimize the number of machine images
  • use user-data to self-configure
  • understand how EBS volumes work
  • use spot instance when you can
  • pick smart inputs and boundaires for autoscaling
    • use twillio
  • use IAM and consolidated billing
  • boot faster with EBS-backed instances
  • detect dead instances
    • telnet to port 22
  • the cloud is not infinite
  • be a good citizen

4 1:30pm - Google App Engine Workshop

  • Cloud Computing
  • Tags: chun, appengine, app_engine, core_python_, development, google_appengine, hosting, cloud, cloud_computing, computing, datastore, java, platform, nosql, scalability, business, enterprise, django, google, wesley_chun, google_app_engine, python
  • wesley chun (Google)

attendance: 110

https://code.google.com/appengine/images/appengine_lowres.png

  • SaaS
    • Google docs/spreadsheet, netsuite, IBM LotusLive, SalesForce.com
  • PaaS
    • Rollbase, GAE, force.com, Azure
  • IaaS
    • rackspace, joyent, vmware, AWS

GAE

  • build and test app locally
  • upload to GAE
  • GAE runs - not need to worry about machines, network, storage, scalability,…
  • "we wear pagers so you don't have to"

DIY hosting

  • idle capacity, patches/upgrades, license, maintenance, traffic, …

Components

  • scalable infrastructure
    • Linux, GFS, Bigtable, Hardware
  • language runtimes
    • python, java (Scala, JRuby, Groovy, Rhino/JavaScript, Jython, Quercus/PhP), go
    • java
      • servlet (web app container), JDO/JPA (datastore API),
        java.net.URL, javax.mail, javax.cache (memcache)
  • web-based admin
    • logs, quota, data store, billing, health
  • SDK
    • run locally, deploy, versioning, …

Users

  • BestBuy, ebay, Forbes, SocialWok, BuddyPoke, gigya, webFilings, …

services/APIs

  • Memcache
  • Datastore
  • URL Fetch
  • Mail
  • XMPP
  • Task Queue
  • Images
  • Blobstore
  • Users Service

5 2:20pm - Open Source Compiler Construction for the JVM

attendance: 20

JVM

  • stack-based architecture-independent VM
  • impls: Oracle/HotSpot, Apache/Harmony, OpenJDK, …

Scala

  • scala
  • has own standard library
  • runs on JVM (and .NET)

Apache BCEL

  • Emit JVM bytecode via API
  • could use other libraries besides BCEL

Compiler Architecture

  • scanner : tokenizer
  • parser : organizes tokens into Abstract Syntax Tree (AST)
  • semantic checks
  • code gen : traverse AST to produce target code

Parsing with Scala parser combinators

  • combine small functions to describe a language in pseudo-EBNF

Example: calculator BNF

Uses

  • VIM
  • apache builder (less horrible maven)


6 3:30pm - Using jQuery with Node.js

  • Node Day
  • Elijah Insua (None)

attendance: 30

http://static.jquery.com/files/rocker/images/logo_jquery_215x53.gif

http://github.com/tmpvar/jsdom

use-case

  • scraping web sites

Isn't jQuery a browser library?

  • yes, but use jsdom 0.2.1 on service

DOM

  • tree of nodes representing/manipulate a document
  • level 1: foundation: document node, attribute, element, append/removeChild, getElementByTagName
  • level 2 core: namespaces, getElementById/ByNameNS
  • level 2 events: react to events; mutation events
  • level 2 html: a, form, div, img, …
  • level 3: normalize; compareDocumentPosition; get/setUserData, lookupNamespace

DOMWindow (aka window)

  • global context for javascript
  • location, self, frames, navigator, scren, getComputedStyle, …
jsdom.jQueryify or jsdom.env

Current impl has memory leak.


7 4:20pm - Lumberyard: Time Series Indexing at Scale

  • Data: Analytics and Visualization, Data: Hadoop, Data: NoSQL Databases
  • Tags: index, hadoop, data_scientists, hbase, nosql_nerd, timeseries, scaling_geek, search
  • Josh Patterson (Cloudera)

attendance: 40

https://github.com/jpatanooga/Lumberyard

http://www.cloudera.com/blog/2011/04/simple-moving-average-secondary-sort-and-mapreduce-part-3/

Lumberyard

  • time seriex iSAX indexing stroed in HBase for persistent/scalable
    index storage

Original Motivation

  • 120 sensors, 30 samples/second = 4.3B/day
  • http://openpdc.codeplex.com/
  • needed to find "unbounded oscillations"
  • found "SAX" by Keogh for time series data

Time Series Data : time stamp + floating point value

Speed at scale is the killer app

Unstructured data explosion

HBase: BigTable-like storage for Hadoop

  • leverages HDFS as BigTable leveraged GFS

How to query time series with SQL?

iSAX and Time Series Data

  • Indexable Symbolic Aggregate approXimation
  • discretizes curves
  • Modifies SAX to allow extensible hashing and multi-level resolution
    representation
  • similar to a b-tree
    • nodes represent iSAX words
    • internal nodes and leaf nodes
    • leaf nodes fill up until reaching a threshold

100 million smples indexed, 1/2 TB of data

  • linear scan: 1800 minutes
  • exact iSAX: 90 minutes
  • approximate iSAX: 1.1 seceond

http://code.google.com/p/jmotif

lumberyard

  • jmotif implements core iSAX
  • lumberyard implements storage backend in HBase
  • index size now scale up to TB

For fast fuzzy query lookups that don't need an exact match

use-cases

Related Topics >>