Skip to main content

The Need to Feed

Posted by rsmudge on July 25, 2008 at 8:36 AM PDT

I'm reading Collective Intelligence by Toby Segaran. Excellent book. The book is a very practical introduction to machine learning and data mining techniques. All the examples are in Python which isn't a real problem. However I am eager to try them out in Java which means I am hunting for Java equivalents of the Python libraries used in the book.

Currently, I'm looking for a decent "universal" feed parser. I'm actually planning to apply this to a real project. Something forgiving of broken XML is important.

Here is what I've found so far:

Jakarta Commons Feed Parser

I had so much hope for this library. It is SAX based and forgiving of broken XML. It is dependent on 6 other external libraries which means I get to download and organize my jar files. This is a great way to feel productive when I want to avoid organizing my sock drawer. Except, there is no download link and the SVN link is broken as well. Actually all sarcasm aside, I think this would be my first choice if I could get to it.

Eddie RSS and Atom Parser for Java

From the page: "Eddie is a liberal RSS and Atom feed parsing library for Java. It is a SAX based parser and as a result is capable of parsing a significant number of broken feeds. It was written after discovering that the well-known ROME feed parser is implemented using DOM and therefore incapable of dealing with ill formed XML. It also failed to parse some well formed feeds too." -- When I get a chance I plan to look into this one. Its my second choice for now.

Rome

Of course there is the popular Rome feed parser. I like what I see on the java.net webpage. A minimal external library dependency!! The only thing I fear is broken feeds. I assume if its that big of a deal they have probably done something about it by now. This one uses a DOM model and this article looks like a helpful start.Fast forward to the future (now). I have done my best to play with these parsers. Here is what happened:

Jakarta Commons Feed Parser

I snooped around the SVN directories and the site even more hoping to find some way to get the magic nectar that is this code. No dice and I was out of luck.

Eddie RSS and Atom Parser for Java

I tried to compile this. I ended up downloading many of the libraries it needed and trying to compile them. Where possible I tried downloading binaries of the libraries. I then became frustrated and downloaded a binary. I then realized the site had no javadoc on it and I'd have to generate my own. I know I don't need the libs to generate the javadoc but by this point I was done with Eddie. (This post doesn't have a happy ending btw).

Rome

I did get Rome and the Rome Feed Fetcher Library working. Rome scores massive developer friendly points for having a build file that worked right off the bat. Type 'ant' and go!!! For the Fetcher I had to create my own build file. Then I ran into lots of fun loading the whole thing into my scripting language (apparently Feed Fetcher crashes if it can't find a properties file in the system classpath). So that was lots of fun. Overall I was happy with Rome and it worked pretty well except it lost information when parsing the (probably incorrectly formatted) feed for http://sleep.dashnine.org/.

What's a code hacker to do?

So I decided to go with a Python solution. I downloaded the Universal Feed Parser library and made a Python script to dump my downloaded feeds to a file so I could process them from my Java code. Everything worked without a hitch.

Comments

Commons Feedparser never got out of the "Sandbox" or had any releases - so your only option is to check it out of subversion using the following link (the one Laird provided is just for browsing) and build it yourself: http://svn.apache.org/repos/asf/commons/dormant/feedparser/trunk/ AFAIK Commons Feedparser was the work of Kevin Burton, but he decided to go elsewhere to continue to develop the code - see: http://tailrank.com/code.php

In usual Apache fashion things are disorganized, cluttered and broken, but, in usual Apache fashion, you can bash your way around some of the scripts to get where you need to go. Here, for example, is the viewCVS link that works: http://svn.apache.org/viewvc/commons/dormant/feedparser/ Hope that helps you. Best, Laird (no affiliation with ASF)