The Need to Feed
I'm reading Collective Intelligence by Toby Segaran. Excellent book. The book is a very practical introduction to machine learning and data mining techniques. All the examples are in Python which isn't a real problem. However I am eager to try them out in Java which means I am hunting for Java equivalents of the Python libraries used in the book.
Currently, I'm looking for a decent "universal" feed parser. I'm actually planning to apply this to a real project. Something forgiving of broken XML is important.
Here is what I've found so far:
I had so much hope for this library. It is SAX based and forgiving of broken XML. It is dependent on 6 other external libraries which means I get to download and organize my jar files. This is a great way to feel productive when I want to avoid organizing my sock drawer. Except, there is no download link and the SVN link is broken as well. Actually all sarcasm aside, I think this would be my first choice if I could get to it.
From the page: "Eddie is a liberal RSS and Atom feed parsing library for Java. It is a SAX based parser and as a result is capable of parsing a significant number of broken feeds. It was written after discovering that the well-known ROME feed parser is implemented using DOM and therefore incapable of dealing with ill formed XML. It also failed to parse some well formed feeds too." -- When I get a chance I plan to look into this one. Its my second choice for now.
Of course there is the popular Rome feed parser. I like what I see on the java.net webpage. A minimal external library dependency!! The only thing I fear is broken feeds. I assume if its that big of a deal they have probably done something about it by now. This one uses a DOM model and this article looks like a helpful start.Fast forward to the future (now). I have done my best to play with these parsers. Here is what happened:
I snooped around the SVN directories and the site even more hoping to find some way to get the magic nectar that is this code. No dice and I was out of luck.
I tried to compile this. I ended up downloading many of the libraries it needed and trying to compile them. Where possible I tried downloading binaries of the libraries. I then became frustrated and downloaded a binary. I then realized the site had no javadoc on it and I'd have to generate my own. I know I don't need the libs to generate the javadoc but by this point I was done with Eddie. (This post doesn't have a happy ending btw).
I did get Rome and the Rome Feed Fetcher Library working. Rome scores massive developer friendly points for having a build file that worked right off the bat. Type 'ant' and go!!! For the Fetcher I had to create my own build file. Then I ran into lots of fun loading the whole thing into my scripting language (apparently Feed Fetcher crashes if it can't find a properties file in the system classpath). So that was lots of fun. Overall I was happy with Rome and it worked pretty well except it lost information when parsing the (probably incorrectly formatted) feed for http://sleep.dashnine.org/.
What's a code hacker to do?
So I decided to go with a Python solution. I downloaded the Universal Feed Parser library and made a Python script to dump my downloaded feeds to a file so I could process them from my Java code. Everything worked without a hitch.