|
|
||
Raphael Mudge's BlogThe Need to FeedPosted by rsmudge on July 25, 2008 at 08:36 AM | Comments (2)I'm reading Collective Intelligence by Toby Segaran. Excellent book. The book is a very practical introduction to machine learning and data mining techniques. All the examples are in Python which isn't a real problem. However I am eager to try them out in Java which means I am hunting for Java equivalents of the Python libraries used in the book. Currently, I'm looking for a decent "universal" feed parser. I'm actually planning to apply this to a real project. Something forgiving of broken XML is important. Here is what I've found so far: I had so much hope for this library. It is SAX based and forgiving of broken XML. It is dependent on 6 other external libraries which means I get to download and organize my jar files. This is a great way to feel productive when I want to avoid organizing my sock drawer. Except, there is no download link and the SVN link is broken as well. Actually all sarcasm aside, I think this would be my first choice if I could get to it. Eddie RSS and Atom Parser for Java From the page: "Eddie is a liberal RSS and Atom feed parsing library for Java. It is a SAX based parser and as a result is capable of parsing a significant number of broken feeds. It was written after discovering that the well-known ROME feed parser is implemented using DOM and therefore incapable of dealing with ill formed XML. It also failed to parse some well formed feeds too." -- When I get a chance I plan to look into this one. Its my second choice for now. Of course there is the popular Rome feed parser. I like what I see on the java.net webpage. A minimal external library dependency!! The only thing I fear is broken feeds. I assume if its that big of a deal they have probably done something about it by now. This one uses a DOM model and this article looks like a helpful start. Fast forward to the future (now). I have done my best to play with these parsers. Here is what happened: I snooped around the SVN directories and the site even more hoping to find some way to get the magic nectar that is this code. No dice and I was out of luck. Eddie RSS and Atom Parser for Java I tried to compile this. I ended up downloading many of the libraries it needed and trying to compile them. Where possible I tried downloading binaries of the libraries. I then became frustrated and downloaded a binary. I then realized the site had no javadoc on it and I'd have to generate my own. I know I don't need the libs to generate the javadoc but by this point I was done with Eddie. (This post doesn't have a happy ending btw). I did get Rome and the Rome Feed Fetcher Library working. Rome scores massive developer friendly points for having a build file that worked right off the bat. Type 'ant' and go!!! For the Fetcher I had to create my own build file. Then I ran into lots of fun loading the whole thing into my scripting language (apparently Feed Fetcher crashes if it can't find a properties file in the system classpath). So that was lots of fun. Overall I was happy with Rome and it worked pretty well except it lost information when parsing the (probably incorrectly formatted) feed for http://sleep.dashnine.org/. What's a code hacker to do? So I decided to go with a Python solution. I downloaded the Universal Feed Parser library and made a Python script to dump my downloaded feeds to a file so I could process them from my Java code. Everything worked without a hitch. Bookmark blog post: CommentsComments are listed in date ascending order (oldest first) | Post Comment
| ||
|
|