Skip to main content

Text Normalization...what's that?

Posted by joconner on August 23, 2006 at 11:52 AM PDT

You remember that I once said that String's equal method just isn't enough sometimes. The reason is that the equal method just doesn't understand combining characters. Like it or not, the Unicode standard allows us to create equivalent text in multiple ways. Here's an example of how you can spell Michèle two ways:

  1. Precomposed characters: M i c h &egrave l e
  2. Combining sequences: M i c h e ` l e

In case you overlooked it, that second example shows how to spell the name using e + ` instead of the precomposed è. Visually and linguistically, of course, these strings are the same. Any decent graphical system presents them the same way visually. And we don't really care how such a name is spelled in a database, or perhaps we shouldn't. However, if you use String's equal method to determine whether these two are the same, well, now we have a problem. The equal method compares each unique char unit in the doesn't realize that e + ` (two char units) is linguistically equivalent to the è single character. So, what's a savvy programmer supposed to do for this comparison?

You should normalize the text, of course. To normalize means that you put the text in a common form. For example, you might decide that your application just doesn't want the hassle of combining sequences, so you might normalize all text so that all accented characters have a precomposed representation in your data storage. So, even though a user may enter M i c h e ` l e, your app converts it to M i c h è l e right away. The normalized form would be a precomposed form for all accented characters. Of course, you might decide on the other form: combining sequences. So if the user enters M i c h è l e, your app decomposes the string to the e ` form. Either way, the process of converting text to a common form is called normalization, and the API hasn't been available until Java SE 6...shh, we can't say "Mustang" anymore.

The Normalizer class was hidden away in a non-public package until now. It has been doing its job for a long time, primarily for the Collator class. However, now the API is public as java.text.Normalizer. You can use this class to normalize text for more complex comparisons. Once you've normalized text, you can fall back to String's equal method to help you compare the text. Wait...I said this method didn't work earlier. Now it does? Well, yes, but only when the text has been normalized. And now you can do that with the newly available java.text.Normalizer API.

Several normalization forms exist, represented by the enum Normalizer.Form:

  • NFC -- canonical composition
  • NFD -- canonical decomposition
  • NFKC -- compatibility composition
  • NFKD -- compatibility decomposition

I'll revisit each of these in the next few days. I think you'll find this new class helpful in dealing with the many ways that people create the same text.

Related Topics >>