Bring Your Code: An algorithm for index translation
Here is a little code challenge !
I'm actually working on a text-mining/semantic web application focused (for the moment) on biomedical informations and developed in Java. We are using external tools for text-mining analysis and unfortunatly theses tools don't handle HTML pretty well ... If we send raw HTML to the text-mining service, he simply break. So we must convert HTML to plain-text before processing text, and because the tools return identified words by giving their positions, we must translate theses position (or indexes) to find corresponding word in the original HTML.
I created a simply implementation and posted it on gist.github.com ... can you make it better ?
Here is the full blog entry : http://aloiscochard.blogspot.com/2010/06/bring-your-code-algorithm-for-index.html