Skip to main content

Bring Your Code: An algorithm for index translation

Posted by alois on June 30, 2010 at 10:30 AM PDT

Here is a little code challenge !

I'm actually working on a text-mining/semantic web application focused (for the moment) on biomedical informations and developed in Java. We are using external tools for text-mining analysis and unfortunatly theses tools don't handle HTML pretty well ... If we send raw HTML to the text-mining service, he simply break. So we must convert HTML to plain-text before processing text, and because the tools return identified words by giving their positions, we must translate theses position (or indexes) to find corresponding word in the original HTML.

I created a simply implementation and posted it on gist.github.com ... can you make it better ?

Here is the full blog entry : http://aloiscochard.blogspot.com/2010/06/bring-your-code-algorithm-for-index.html

Comments

Great information

Keep up the great work!

I know you want code...

but can't you just replace all the html characters with spaces and then run it through your text mining service?

Exampe:

HTML:

This is <b>a test </b>.

index of a = 11

Replace tages with spaces:

This is    a test     .

index of a = 11

 Thanks for your suggestion,

 Thanks for your suggestion, but unfortunately it's not a possible solution, read my post here to get detailed informations and tell me if you have any idea would love to hear :

http://forums.thedailywtf.com/forums/t/18230.aspx