Skip to main content

Book review: Mining the Social Web by Mathew A. Russell

Posted by bhaktimehta on February 16, 2014 at 11:34 PM PST

There is so much potential in extracting, processing and synthesizing the the multi faceted, realtime data that can be mined from social sites.
Mining the Social Web by Mathew A. Rusell covers these topics scrupulously. Here is my detailed review of the book

Chapter 1: Mining Twitter, Exploring Trending topics, Discovering what users are talking about

This is an informative chapter which covers from the basics of how to create applications with Twitter, authorizing an application to access Twitter data, looking for trends, searching for tweets and how to extract the text, screen names and hashtags from the tweets. It also covers how to compute the lexical diversity of the tweets and
visualizing the data with histograms. It covers matplotlib, prettytable and other Python packages. I have used Twitter APIs extensively and found this chapter very useful and well written.

Chapter 2 : Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More

This chapter covers the developer tools for Facebook, Graph API explorer, using the API over HTTP, Open Graph Protocol, examining friendships and analyzing social graph connections. It demonstrates how to use facebook-sdk package to make FQL queries. Other examples include computing overlapping likes in social network, analyzing mutual friendships and visualizing with D3.js.
This chapter makes it apparent that there are many exciting possibilities for what can be done with social data, and that there’s enormous value tucked away on Facebook which can be extracted and studied.

Chapter 3 : Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More

This chapter covers how to use the LinkedIn developer API, retrieving LinkedIn connections. It also covers data normalization and similarity computation techniques, visualizing locations with cartograms, clustering algorithms. I think this was one of the most detailed and informative chapters of the book and I thoroughly enjoyed reading it.

Chapter 4 : Mining Google+: Computing Document Similarity, Extracting Collocations, and More

This chapter covers introduction to Google+ Developer API. It covers TF-IDF which stands for term frequency-inverse document frequency, it also covers algorithms to find similar documents, visualizing document similarities with martix diagrams,
contingency tables and scoring functions. Most importantly this chapter covered so many techniques to extract information from unstructured text with Information retrieval techniques in meticulous detail.

Chapter 5 : Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

Chapter 5 discusses context driven techniques and goes in the semantics of human language data. It covers the Boilerpipe library to extract text from a web page, feedparser to extract from RSS feeds, web crawling techniques, EOS detection,Tokenization, Part-of-speech tagging and Chunking. It shows how to use Natural Language Toolkit (NLTK) to extract entities from unstructured text.

Chapter 6 : Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More

Chapter 6 covers basic semantics of mbox, mail headers, how to convert mbox to json and import to MongoDB, querying by date time, analyzing patterns in sending and receiving messages using Enron data. It also covers how to analyse your own mail data and access Gmail data with OAuth. This chapter is built on previous ones and covers a lot from basic mbox to mining mboxes to slicing and dicing the data.

Chapter 7 : Mining GitHub: Inspecting Software
Collaboration Habits, Building Interest Graphs, and More

This chapter covers GitHub’s developer platform and how to make API requests. It covers an example with PyGithub to access the query objects. Additionally it covers the concept of an interest graph and how to construct an interest graph from GitHub data and how to model property graphs with NetworkX. I found the section on getting the repositories from a graph, finding the programming languages for each user and the GitHubArchive very educative.

Chapter 8 : Mining the Semantically Marked-Up Web:
Extracting Microformats, Inferencing over RDF, and More

This chapter talks about microformats an examples covered using Google's Data testing tool to access semantic markup from web pages for eg. LinkedIn's profiles, Wikipedia, About.com, based on microformats, Additionally there is an interlude to the semantic web and the evolution of web with the various manifestations and characteristcis which was very interesting.

Part 2 Twitter Cookbook

This is a whole section dedicated to recipes for mining Twitter's accessible APIs due to the openness and emerging popularity. I have done a few sentiment analysis related projects with Twitter API and found this really helpful for people who want to dig deeper in the API. This covers different recipes in form of problems, solutions and discussions from topics ranging from OAuth, finding the populatr trends, searching tweets. I particularly liked the recipes for using the Twitter Streaming API to sample data from the Twitter firehose, saving and access Twitter feeds with MongoDB.

Additional Observations

  • I like the author's use of additional online resources for further reference as well as the recommended exercises section at the end of each chapter.
  • I enjoyed the author's style of introducing the social website, the API to fetch the data, the techniques and complexities and every chapter had some handy tips and tools which could be learnt about. Additionally every chapter was built upon a lot of ideas and discussions covered in the previous ones which tied the concepts neatly.
  • I think the code samples based on Python and the myriad of packages introduced in the various chapters help make data access, manipulation and visualization easy to follow.

Conclusion

I thoroughly enjoyed reading and reviewing "Mining the Social Web" .
This is a great book to explore the rich data that can be extracted from Twitter, Facebook, Linkedin, Google+, GitHub. It has detailed information for everyone from an Information Retrieval(IR) enthusiast, a data scientist, an analyst or just a curious reader looking to learn the different API available out there to extract, mine, visualize and interpret the data and explore the infinite possiblities by using the diverse building blocks together.