OSCON Tuesday
Tuesday, 07/26/2011
Table of Contents
- 1 9:00am - Git Foundations
- 2 10:40am - Taming the Big Data Fire Hose
- 3 11:30am - Managing Thousands of Cloud Instances with Java
- 4 1:30pm - Google App Engine Workshop
- 5 2:20pm - Open Source Compiler Construction for the JVM
- 6 3:30pm - Using jQuery with Node.js
- 7 4:20pm - Lumberyard: Time Series Indexing at Scale
1 9:00am - Git Foundations
- Tools and Techniques
- Tags: master_class, version_control, source_code_control, github, open_source, git, dvcs, vcs
- Tim Berglund (August Technology Group, LLC), Matthew McCullough
(Ambient Ideas, LLC)
attendance : 110
# List the current config
git config --global --list
# new config
git config --global user.name "foo.bar"
git config --global user.email "foo@bar.baz"
# see ~/.gitconfig
# can sync that file to other machines
# force files to be LF in the repo, even on Mac/Linux
git config --global core.autocflf input
# force windows to convert to platform on checkout
# and to LF on commit
git config --global core.autocrlf true
# editor
git config --global core.editor "emacs"
config levels (in precedence order - top highest):
- local : config a setting in a .git repo
- global : in the user's home dir
- system : all users on the system
working staging repo
<- checkout --------
-add---->
-commit-->
# new git working directory
cd <somewhere>
git init project1
cd project1
git status
# add a file to repo
echo "foo" > bar.txt
git status
git add bar.txt
git status
git commit -m "commit comment"
# edit bar.txt, then:
git status
# add is not "adding new" it is "add activity: change, move, delete, ..."
git add *.txt
git status
git commit -m "update"
# edit bar.txt, then (see: content addressable file system)
git add -p bar.txt
# verbose commit (give you lots of info)
git commit -v
# skip staging area and go from modified files to commiting them
git commit -a
# view what is modified but not stage
git diff
# view what is staged but not commited
git diff --staged
# view what is modifed or staged but not committed
git diff HEAD
# history
git log
git log --stat
git log -p
git log --diff-filter=A
git log --pretty=raw
git log -3
Use magit for emacs + git
ignored untracked tracked tracked tracked
unmod moded staged
<-ignore
----------add------------->
--edit-->
--add-->
<--- checkout-----
# to ignore - this file anywhere in tree applies downward
# ! means to keep
# <foo>/ means directory
emacs .gitignore
*.log
*.tmp
target
output/
!special.log
2 10:40am - Taming the Big Data Fire Hose
attendance: 25
big data define
- velocity
- volume
- variety
This talk about velocity
- lot of independent things happening at high frequency
- want to update some state based on those events
- want to query that state in real time - usually pre-defined queries
- you want to record into persistent store after analysis
- usually on a budget
Use-case
- finance trade, telco calls, micro tx, geo
dashboard
|
v
-> ... events -> velocity --cooked events--> analytic store/engine
engine (TB+)
(GB)
Velocity engine
- validate
- respond
- count/aggregate
- enrich
Started with H-Store (rethink RDBMS for 21st century). VoltDB
commercializes.
Workshop Tuesday at 5pm at Hilton.
Tables partitioned. Partitions put on machines. Stored procedures
ordered. Serialized to machines.
Concurrency by scheduling, not locking.
3 11:30am - Managing Thousands of Cloud Instances with Java
- Java: Cloud
- Tags: cloud_computing, java, server, database
- Patrick Lightbody (wrote books -> jive/selenium -> gomez -> BrowserMob -> Neustar Webmetrics)
attendance: 25
example use case: load test: Spin up N browsers in cloud and hit your service
- AWS SDK for Java
- Typica
- supports EC2, SNS, SQS, SimpleDB, FPS, CloudWatch
- jclouds
- abstraction in front of lots of cloud vendors
- API based on Google Guice
Tips
- architect with pricing structure in mind
- e.g., no data transfer charge between EC2 and other AWS in same region
- minimize the number of machine images
- use user-data to self-configure
- understand how EBS volumes work
- use spot instance when you can
- pick smart inputs and boundaires for autoscaling
- use twillio
- use IAM and consolidated billing
- boot faster with EBS-backed instances
- detect dead instances
telnetto port 22
- the cloud is not infinite
- be a good citizen
4 1:30pm - Google App Engine Workshop
- Cloud Computing
- Tags: chun, appengine, app_engine, core_python_, development, google_appengine, hosting, cloud, cloud_computing, computing, datastore, java, platform, nosql, scalability, business, enterprise, django, google, wesley_chun, google_app_engine, python
- wesley chun (Google)
attendance: 110
- SaaS
- Google docs/spreadsheet, netsuite, IBM LotusLive, SalesForce.com
- PaaS
- Rollbase, GAE, force.com, Azure
- IaaS
- rackspace, joyent, vmware, AWS
GAE
- build and test app locally
- upload to GAE
- GAE runs - not need to worry about machines, network, storage, scalability,…
- "we wear pagers so you don't have to"
DIY hosting
- idle capacity, patches/upgrades, license, maintenance, traffic, …
Components
- scalable infrastructure
- Linux, GFS, Bigtable, Hardware
- language runtimes
- python, java (Scala, JRuby, Groovy, Rhino/JavaScript, Jython, Quercus/PhP), go
- java
- servlet (web app container), JDO/JPA (datastore API),
java.net.URL, javax.mail, javax.cache (memcache)
- servlet (web app container), JDO/JPA (datastore API),
- web-based admin
- logs, quota, data store, billing, health
- SDK
- run locally, deploy, versioning, …
Users
- BestBuy, ebay, Forbes, SocialWok, BuddyPoke, gigya, webFilings, …
services/APIs
- Memcache
- Datastore
- URL Fetch
- XMPP
- Task Queue
- Images
- Blobstore
- Users Service
5 2:20pm - Open Source Compiler Construction for the JVM
- Java: JVM
- Tags: scala, programming, parser_combinators, jvm, compilers, bcel, java, apache
- Tom Lee (Shine Technologies)
attendance: 20
JVM
- stack-based architecture-independent VM
- impls: Oracle/HotSpot, Apache/Harmony, OpenJDK, …
Scala
- scala
- has own standard library
- runs on JVM (and .NET)
Apache BCEL
- Emit JVM bytecode via API
- could use other libraries besides BCEL
Compiler Architecture
- scanner : tokenizer
- parser : organizes tokens into Abstract Syntax Tree (AST)
- semantic checks
- code gen : traverse AST to produce target code
Parsing with Scala parser combinators
- combine small functions to describe a language in pseudo-EBNF
Example: calculator BNF
Uses
- VIM
- apache builder (less horrible maven)
…
6 3:30pm - Using jQuery with Node.js
- Node Day
- Elijah Insua (None)
attendance: 30
http://github.com/tmpvar/jsdom
use-case
- scraping web sites
Isn't jQuery a browser library?
- yes, but use jsdom 0.2.1 on service
DOM
- tree of nodes representing/manipulate a document
- level 1: foundation: document node, attribute, element,
append/removeChild,getElementByTagName - level 2 core: namespaces,
getElementById/ByNameNS - level 2 events: react to events; mutation events
- level 2 html:
a,form,div,img, … - level 3: normalize;
compareDocumentPosition;get/setUserData,lookupNamespace
DOMWindow (aka window)
- global context for javascript
- location, self, frames, navigator, scren,
getComputedStyle, …
jsdom.jQueryify or jsdom.env
Current impl has memory leak.
7 4:20pm - Lumberyard: Time Series Indexing at Scale
- Data: Analytics and Visualization, Data: Hadoop, Data: NoSQL Databases
- Tags: index, hadoop, data_scientists, hbase, nosql_nerd, timeseries, scaling_geek, search
- Josh Patterson (Cloudera)
attendance: 40
https://github.com/jpatanooga/Lumberyard
http://www.cloudera.com/blog/2011/04/simple-moving-average-secondary-sort-and-mapreduce-part-3/
Lumberyard
- time seriex iSAX indexing stroed in HBase for persistent/scalable
index storage
Original Motivation
- 120 sensors, 30 samples/second = 4.3B/day
- http://openpdc.codeplex.com/
- needed to find "unbounded oscillations"
- found "SAX" by Keogh for time series data
Time Series Data : time stamp + floating point value
Speed at scale is the killer app
Unstructured data explosion
HBase: BigTable-like storage for Hadoop
- leverages HDFS as BigTable leveraged GFS
How to query time series with SQL?
iSAX and Time Series Data
- Indexable Symbolic Aggregate approXimation
- discretizes curves
- Modifies SAX to allow extensible hashing and multi-level resolution
representation - similar to a b-tree
- nodes represent iSAX words
- internal nodes and leaf nodes
- leaf nodes fill up until reaching a threshold
100 million smples indexed, 1/2 TB of data
- linear scan: 1800 minutes
- exact iSAX: 90 minutes
- approximate iSAX: 1.1 seceond
http://code.google.com/p/jmotif
lumberyard
- jmotif implements core iSAX
- lumberyard implements storage backend in HBase
- index size now scale up to TB
For fast fuzzy query lookups that don't need an exact match
use-cases
- openPDC
- Genome Data as time series
- A, C, G, T
- images as time series
- convert shapes to 1D signals
- e.g., unroll perimeter into a line
- convert shapes to 1D signals
- http://flumebase.org/
- http://opentsdb.net/
- Login or register to post comments
- Printer-friendly version
- haroldcarr's blog
- 987 reads





