Bring Your Code: An algorithm for index translation
Here is a little code challenge !
I'm actually working on a text-mining/semantic web application focused (for the moment) on biomedical informations and developed in Java. We are using external tools for text-mining analysis and unfortunatly theses tools don't handle HTML pretty well ... If we send raw HTML to the text-mining service, he simply break. So we must convert HTML to plain-text before processing text, and because the tools return identified words by giving their positions, we must translate theses position (or indexes) to find corresponding word in the original HTML.
I created a simply implementation and posted it on gist.github.com ... can you make it better ?
Here is the full blog entry : http://aloiscochard.blogspot.com/2010/06/bring-your-code-algorithm-for-index.html
- Login or register to post comments
- Printer-friendly version
- alois's blog
- 1535 reads






Comments
Great information
by karen61 - 2010-07-03 09:02
Keep up the great work!I know you want code...
by aberrant - 2010-07-02 06:45
but can't you just replace all the html characters with spaces and then run it through your text mining service?
Exampe:
HTML:
index of a = 11
Replace tages with spaces:
index of a = 11
Thanks for your suggestion,
by alois - 2010-07-05 08:23
Thanks for your suggestion, but unfortunately it's not a possible solution, read my post here to get detailed informations and tell me if you have any idea would love to hear :
http://forums.thedailywtf.com/forums/t/18230.aspx