Search |
||
Text Normalization...what's that?Posted by joconner on August 23, 2006 at 11:52 AM PDT
You remember that I once said that String's
In case you overlooked it, that second example shows how to spell the name using e + ` instead of the precomposed è. Visually and linguistically, of course, these strings are the same. Any decent graphical system presents them the same way visually. And we don't really care how such a name is spelled in a database, or perhaps we shouldn't. However, if you use String's You should normalize the text, of course. To normalize means that you put the text in a common form. For example, you might decide that your application just doesn't want the hassle of combining sequences, so you might normalize all text so that all accented characters have a precomposed representation in your data storage. So, even though a user may enter M i c h e ` l e, your app converts it to M i c h è l e right away. The normalized form would be a precomposed form for all accented characters. Of course, you might decide on the other form: combining sequences. So if the user enters M i c h è l e, your app decomposes the string to the e ` form. Either way, the process of converting text to a common form is called normalization, and the API hasn't been available until Java SE 6...shh, we can't say "Mustang" anymore. The Several normalization forms exist, represented by the enum
I'll revisit each of these in the next few days. I think you'll find this new class helpful in dealing with the many ways that people create the same text. »
Related Topics >>
J2SE Comments
Comments are listed in date ascending order (oldest first)
|
||
|
|