 |
Normalization: Canonical Decomposition
Posted by joconner on February 08, 2007 at 01:50 AM | Comments (10)
You'll recall from a previous blog that normalization is the process of transforming text into a standard form that facilitates reliable searching, sorting, and other text operations. Java SE 6 provides a new normalization API that implements the Unicode standard for normalization: java.text.Normalizer
There are 4 normalization forms: NFD, NFC, NFKD, NFKC. Normalization Form D (NFD) is canonical decomposition, and that's probably the best place to start a description.
Unicode often provides multiple ways to create a character. For example, you can create an e acute in two ways. You might enter a precomposed character é (LATIN SMALL LETTER E WITH ACUTE). Alternatively, you can enter a combining sequence of two characters e+´ (LATIN SMALL LETTER E + COMBINING ACUTE ACCENT). In your character storage, the precomposed character is the value \u00E9. The combining sequence has two values, \u0065 and \u0301. When a graphical rendering engine shows you the characters, no visual difference should exist. The precomposed character and the combining sequence mean the same thing, and you should see the same glyph on the screen: é.
So, if both forms are both semantically and visually the same, what's the big deal? Why am I bringing this up? I'm bringing it up because applications, tools, and databases sometimes store character data in different forms. You can store the name "José" with either precomposed text or combining sequences. If you search for "José" in your corporate database, you probably don't care whether the db stores the name as J+o+s+é or J+o+s+e+´. However, your search algorithm might! If the db has the combining sequence in its storage, but you search for the precomposed form, your software may never find "José". Your software must compare text in the same form (the same normalization form) to properly find the match. NFD is one such normalization form.
Let's belabor the point even further. Assume for a moment that you have two String objects:
String name1 = "Jos\u00E9"; // José with precomposed é
String name2 = "Jos\u0065\u0301"; // José with combining sequence e + ´
If you view these strings in a list or other Swing component, the characters have the same visual shape:
However, if you compare them with the String class equals method, you'll see that they are different. Can you imagine the headache you'd have trying to figure out why name1.equals(name2) is false with these strings? The fact that the two strings look the same visually would be especially misleading. Oh sure, you know they're different now. You have both name1 and name2 right above you here, and you can see that the char values are different. However, you wouldn't have that advantage if you were searching for José in the corporate LDAP directory. I'm not saying your LDAP directory stores information in NFD, but I am saying that the mixed comparison between precomposed and combining sequence text forms is not going to give you good results when using the String equals method. You should use normalized text for many kinds of comparisons, especially when your text comes from two different sources.
When you normalize text to NFD, you decompose all characters that have equivalent combining sequences. A single, visual character might decompose into 2 or more character values in a combining sequence. The character é (\u00E9) becomes the sequence e+´ (\u0065\u0301).
Fortunately, you don't have to remember all the decompositions in Unicode. The java.text.Normalizer class, now available in Java SE 6, can normalize text for you. Here's an example using canonical decomposition (NFD):
String name1 = "Jos\u00E9"; // José with precomposed é
String name2 = Normalizer.normalize(name1, Normalizer.Form.NFD);
The name2 contents are now these values: Jos\u0065\u0301
Now, if your db or other source of text contains NFD text, you're ready for the comparisons.
You have to compare apples and apples, so to speak. So, you have to normalize the text first. Depending on your data sources, you may have to normalize text from one or both sources for a correct comparison.
Of course, the Collator class hides all these normalization details. It uses Normalizer underneath. However, many developers like the freedom of having direct access to the Normalizer functionality.
OK, I've driven this point right into the ground. In summary, NFD is a normalization form. There are 3 other forms. You should know what form your text data is in...it makes comparisons, searches, and other text processing more reliable. You now have direct access to a normalization class that was previously hidden. Use the Normalizer class to prepare text data for storage, transport, or comparisons.
Bookmark blog post: del.icio.us Digg DZone Furl Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment
-
Great description.
Thanks!
Posted by: svalencia on February 08, 2007 at 12:50 PM
-
very useful feature! didn't know about it yet, so good thing that you blogged about it..
it may save me some headaches a time :-)
Posted by: zero on February 08, 2007 at 01:15 PM
-
this little gem is one of my favourite features in Java 6! Im using it already. Extra i18n support is always much appreciated.
Posted by: benloud on February 09, 2007 at 06:19 AM
-
Thanks, John, for a great explanation of a concept that's too-often overlooked.
As an aside, since you or viewers of your blog are more likely to be up to speed in this area, are there any good facilities for compression of collation keys? I've tried to read the Unicode notes on this, but never gotten very far with my limited expertise.
Posted by: erickson on February 09, 2007 at 09:37 AM
-
Thanks.
Posted by: steevcoco on February 09, 2007 at 01:09 PM
-
John, I would like to know, will improvements to the ICU get merged with the JDK? I find generating a CollationKey with the JDK to be as much as 6x slower than the ICU. The JDKs Normalizer also seems to be around 10-20% slower. I'd just like to know what the current relationship is between the JDK and the ICU
Posted by: benloud on February 09, 2007 at 06:36 PM
-
@benloud, unfortunately I'm not part of that i18n team, so I don't know any specific plans. However, the push to use the CLDR is unmistakable. I think that the CLDR, which came out of the ICU project will probably find its way into the JDK in a future release. I'm less certain about the ICU classes themselves.
Posted by: joconner on February 20, 2007 at 10:21 AM
-
Fab! Now I have one less excuse to postpone upgrading my indexing component from ISO8859-1 to Unicode!
Posted by: damonhd on March 01, 2007 at 05:31 AM
-
$NFD_string = NFD($string)
returns the Normalization Form D (formed by canonical decomposition).
$NFC_string = NFC($string)
returns the Normalization Form C (formed by canonical decomposition followed by canonical composition).
$NFKD_string = NFKD($string)
returns the Normalization Form KD (formed by compatibility decomposition).
$NFKC_string = NFKC($string)
returns the Normalization Form KC (formed by compatibility decomposition followed by canonical composition).
$normalized_string = normalize($form_name, $string)
if you have any other examples feel free to post it on my blog.
And I almost forgottt, if anyone has kind of API or exampe for Quick Check?
Posted by: angelas on May 11, 2007 at 02:54 PM
-
Finally!!!!! :)
Posted by: marma on May 22, 2007 at 01:24 PM
|