The Source for Java Technology Collaboration
User: Password:



Kohsuke Kawaguchi

Kohsuke Kawaguchi's Blog

Intelligent != Diligent

Posted by kohsuke on August 11, 2006 at 02:14 PM | Comments (4)

One of the things that really differenciate a good tool from a mediocre tool is the error handling. So in the JAXB RI, I spend a lot of efforts in making sure that the tools and the runtime detects errors, print them in a way that makes sense, and try to diagnose the problem better.

One of the typical human mistakes is a typo. After all, humans are intelligent but not supposed to be diligent, so there's no way we can remember long namespace URIs like http://java.sun.com/xml/ns/jaxb (was there a slash between xml and ns? Was there a trailing slash?) This happens to the JAXB RI users, too. When you type namespace URIs, you often make a typo. So when the JAXB RI finds some unknown type name inside @xsi:type, we first check if it's a typo of some known types.

This is where the notion of edit distance becomes useful. This pretty simple algorithm can compute how "close" one string is to another string, and works well for suggesting a fix for a typo. That is, if you have a list of "correct values", then find the one that has the smallest edit distance to the user given "wrong" value, and that one might be what the user have meant.

This algorithm is useful in many other contexts, when a program deals with human inputs. Just this morning, we were talking about using this algorithm in Glassfish asadmin command, which takes a sub-command name as an option (and there are more sub-commands than we can remember!)

This would be useful in the JAXP validator, too. Wouldn't it be nice if a validator could say "invalid element <naem>. Did you mean <name>?" (I did exactly this in my MSV validator, but the last time I checked the JAXP RI didn't do this.) Or how about your IDE telling you that not only is the class name MassageContext wrong, but it's actually likely a typo of MessageContext? And the list goes on.

If you are interested in using this in your project, Here is the working code under CDDL that I use in the JAXB RI.

P.S. the edit distance algorithm says "diligent" and "intelligent" are not related (distance 5.) Yes, it's that smart.


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first)

  • Do you know what algorithm is used in Firefox 2.0 to propose alternative spellings to the words not found in its dictionary? The reason i'm not mentioning Babylon or Word is that they are close-sourced.

    Posted by: kirillcool on August 11, 2006 at 03:35 PM


  • Making tools handle malformed input is a worthy goal. I remember thinking it would be nice if JUnit could tolerate my typos and run the test whose name was closest to the garbled name I had entered (this was before JUnit was integrated in the IDE).

    Edit distance algorithms are great, but there are enhancements and alternatives which may provide even better corrections. Aspell has a neat algorithm that uses edit distance and phonetic matching (metaphone) to great effect - see this article I wrote a while back which discusses the ideas and a Java implementation (Jazzy).

    Another approach entirely is to use n-grams to do matching. There is a Lucene-based spell checking library that can do this for you - check out Did You Mean: Lucene? for details.

    Finally, there is an implementation of the edit distance algorithm (also called Levenshtein distance) in Apache Commons Lang.

    Tom

    Posted by: tomwhite on August 12, 2006 at 03:01 AM

  • Tom, thanks for the pointers. Very interesting read.

    Posted by: kohsuke on August 12, 2006 at 11:20 PM

  • i took a tour of levenshtein distance as well as the various encoders (metaphone, double metaphone etc) during an attempt to merge databases. It was not fun, but the learning aspect was :)
    Your article was excellent Tom, I read it a while ago when I was learning this stuff. I still hope to apply some things I learned during it.
    The encoders are available in Apache Commons Codec for those who want to play around with those.

    Posted by: ilazarte on August 14, 2006 at 07:37 AM





Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds