Skip to main content

My Strata Tuesday

Posted by haroldcarr on February 28, 2012 at 8:12 PM PST

Tuesday February 28, 2012


1 9:00am Large Scale Web Mining, Ken Krulgler, Scale Unlimited

ken@scaleunlimited.com

@kkrugler

slides

Large scale means you need distributed processing framework.

  • crawl - find the good stuff
    • still need to focus crawl
    • ethics - don't hit machines needlessly
  • extract - get right stuff out
    • conflicting constraints: scale viz precision
    • pages are noisy : ads, boilerplate (navigation), SEO
  • process - turn bytes into bucks
    • reduction - i.e., pie charts
    • index - enable search
    • analytics - cluster, recommend, …

Web crawling

  • fetch pages
  • extract outlinks (means parse fetched content)
  • manage state of crawl
  • implicit rules
    • robots exclusion protocol (robots.txt)
    • user agent (who you are, how to contact you)
    • request rate
  • broad (e.g., google)
  • focused
    • page scoring -> outlinks; perhaps whitelist of domains to avoid traps/honeypots
    • whitelist; use top sites listed in : alexis, comcast (quamcast?) (alexa?)
  • domain
    • limit to certain domains; for precise extraction
  • don't crawl crawl
    • public datasets: leverage other people's crawl data
    • e.g., common crawl, wikipedia (data dump), Spinner, InfoChimps

Crawling solutions

Crawling hard

  • mining breaks implicit contract with sites
    • you are generally not creating an index that drives traffic to them
    • you are using their bandwidth and server cycles
  • "infinite" web means you will run into edge cases
  • not everybody plays nice: link farms/honeypots, malicious sites, angry webmasters
    • use whitelist to avoid
  • risk : can do (perceived) damage to sites (and get sued)
  • work at scale
    • Hadoop - Nutch, Bixo
    • custom queuing - Heritrix, Droids
    • scalable queuing - Storm

Focused crawling

  • only crawl pages likely to be good
  • seed URLs - starting point: put into URL state
  • URL state - DB of all known URLs
  • page score - "quality" of page
  • link score - page score/outlink
  • fetched pages - saved results

Ethical crawling

  • send real, valid, info in user agent name
    • contact info so they can send you email/phone,…
  • honor robots.txt
  • limit your crawl rate
  • comply with blacklisting and data removal requests
  • follow ethical guidelines
  • avoid javascript
  • don't follow form links
  • don't do CAPTCHA
  • grovel when they complain - they are right
1 seed
  -> URL state
    -> 2 sort
      -> 3 focus
        -> 4 fetch and parse
          -> 5 save fetched
            -> 6 page score
              -> 7 extract outlines
                -> 8 score outlinks
  -> 8 URL state

Seed URLs (listed broad to narrow)

  • list of registered domains
  • DMOZ - lots of spam/porn
  • Alexa/Quantcast "top sites" list
  • Wikipedia
  • Tweets - with filtering (e.g, Gnip, DataSift)
  • search
    • manually enter URLs - slow, but curated
    • use API, faster, limited, can have junk

Scoring pages

  • analyze text (tokenize)
  • term-based: count occurrences of all phrases; good phrase (manually picked); bad phrases
    • calculate ratios of counts: good/all - bad/all - score
  • SVN : support vector machine
    • train with documents
    • creates statistical model
      • divides training docs into separate classes
      • used to give an unknown doc a class
  • decided "good"
    • minimum threshold for amount of real content (versus graphics, cruft, boilerplate, navigation, SEO, ads,…)
      • use Boilerpipe and other cleaners
    • detect link farms with fake content

Expand crawl frontier

  • normalize links
    • link lengthening (e.g., bitly)
    • www.(*)/
  • skipping links to low-value pages
    • suffix filtering: images, pdf, binary types
    • DB generated pages

Focused domain crawl

  • one domain
  • generally discovery of target content pages
  • often uses URL patterns to synthesis links
  • two phases
    • crawl : discovery of details pages
    • fetch/process : details pages
  • keep track of page type in URL state DB

Extract

  • characteristics
    • broad - losts of domains/formats
    • precise - very specific types of data
    • accurate - low error rate
  • types
    • instructed broad/accurate
    • semi-structured - broad/precise
    • structured - precise/accurate
  • HTML -> XHTML (TagSoup, NekoHTML, HtmlCleaner)
  • detect Charset of page and turn bytes into characters (Tika, ICU)
  • link extraction
  • remove boilerplate
  • language (e.g., tokenizing, audience)
    • HTTP response: Content-Language: es
    • HTML meta tag :
    • HTML tag attributes
    • analyze: ngram statistics, short words, …
  • Unstructured extraction from HTML
    • title, description from meta
  • Semi-structured
    • patterns: phone numbers, dates
    • microformats
    • NLP: named entities
  • Structured
    • xpath (maybe regex)
    • use firebug - will show xpath for each element
    • beware: some browsers will rewrite HTML
    • dom often generated with JavaScript
  • dealing with JavaScript
    • need to execute: HtmlUnit, qt-webkit, headless Mozilla
    • 10x slower
    • loads server, skews site's statistics (makes webmaster angry)
    • pages work in FF/IE but not HtmlUnit (uses Rhino)
    • pages cause HtmlUnit to hang
    • often a parallel page for web crawlers (look for site map)

Resources


2 1:30pm Hands-on Visualization with Tableau, Jock Mackinlay, Ross Perez, Tableau Software

tableau_logo.gif

Jock:

Data story telling / conversations

poster_OrigMinard.gif

photo51.jpg

many-eyes-blog.jpg

Process

  • task that involves data
  • forage for data
  • search translate to visual form
  • translate to visual form
  • develop insight
  • act / repeat

Data modeling

  • data cleaning : ETL

Ross:

http://bit.ly/z9CWCp


3 5:30pm Get connected

meetups

  • meetup.com/Data-Mining
  • meetup.com/VisualizeMyData

Companies

  • Kaggle
    • run data mining competitions
    • hiring data scientists, developers, product manager, …
    • mission: make data scientists more highly valued based on their models
  • Cloud Physics
    • hiring big data expert
    • funding from tier one venture firm
  • Yummly
    • receipe web site
    • hiring analytics, back-end, front-end, …
  • Netflix
    • hiring data science / data engineering, product management, …
  • Edmodo
    • social learning or K-12
    • in/out of class to develop, deliver, test curriculum
    • measure teacher performance
    • free secure platform, no ads
    • hiring: data scientists
  • GfK
    • global market research
    • have lots of data sets
    • hiring data visualization
  • Uber
    • fancy taxi in SF
    • hiring data team
  • Wealthfront
    • online financial adviser
    • how to help people interact with investments?
    • how to engage in making a good decision with little capital
    • hiring : engage with large proprietary data sets, explain finance, lead designer, engineers
  • Accretive Health
    • healthcare technology : 1/2 billion insurance claim sets
    • hiring data miners
  • Disqus
    • commenting system / discussion network
    • quality of comments, discovery, monetization
    • hiring data team
  • Huawei
    • telecom
    • hiring big data
  • Tango
    • mobile video calling
    • hiring data architect, scientist, analyst, engineer
  • Amazon
    • hiring engineering, system engs, PM, data scientists, …
  • StudyBreak
    • social event discovery and recommendation
    • hiring data scientist
  • Mendeley
    • archive of research documents
    • hiring Java/Hadoop/AWS data scientist/engineers
  • New York Times
    • OpenPaths
    • hiring data scientist, linked data scientist
  • Data Without Borders
    • connect data scientists with social organization to serve humanity
    • "hiring" data volunteers, managers
  • Trulia
    • real estate search engine
    • hiring data mining (home price predictions), data scientist
  • Cloudera
    • Hadoop
    • hiring all engineering roles, data scientist/visualization
  • Flurry
    • mobile analytics for apps
    • hiring data science, visualization, machine learning
  • Ad Mobius
    • hiring machine learning, time series, NLP
  • TenTenData
    • analogical database engine in cloud
    • trillion row web-based spreadsheet
    • hiring …
  • (stealth - no name)
    • mobile cloud intersection with security
    • hiring analytics

4 7:00pm Strata Mini Maker Faire & Data Crush

cheap wine and gadgets…

Related Topics >>