Skip to main content

Normalization: Canonical Decomposition

Posted by joconner on February 8, 2007 at 1:50 AM PST

You'll recall from a previous blog that normalization is the process of transforming text into a standard form that facilitates reliable searching, sorting, and other text operations. Java SE 6 provides a new normalization API that implements the Unicode standard for normalization: java.text.Normalizer

There are 4 normalization forms: NFD, NFC, NFKD, NFKC. Normalization Form D (NFD) is canonical decomposition, and that's probably the best place to start a description.

Unicode often provides multiple ways to create a character. For example, you can create an e acute in two ways. You might enter a precomposed character é (LATIN SMALL LETTER E WITH ACUTE). Alternatively, you can enter a combining sequence of two characters e+´ (LATIN SMALL LETTER E + COMBINING ACUTE ACCENT). In your character storage, the precomposed character is the value \u00E9. The combining sequence has two values, \u0065 and \u0301. When a graphical rendering engine shows you the characters, no visual difference should exist. The precomposed character and the combining sequence mean the same thing, and you should see the same glyph on the screen: é.

So, if both forms are both semantically and visually the same, what's the big deal? Why am I bringing this up? I'm bringing it up because applications, tools, and databases sometimes store character data in different forms. You can store the name "José" with either precomposed text or combining sequences. If you search for "José" in your corporate database, you probably don't care whether the db stores the name as J+o+s+é or J+o+s+e+´. However, your search algorithm might! If the db has the combining sequence in its storage, but you search for the precomposed form, your software may never find "José". Your software must compare text in the same form (the same normalization form) to properly find the match. NFD is one such normalization form.

Let's belabor the point even further. Assume for a moment that you have two String objects:

String name1 = "Jos\u00E9";       // José with precomposed é
String name2 = "Jos\u0065\u0301"; // José with combining sequence e + ´

If you view these strings in a list or other Swing component, the characters have the same visual shape:

SwingNFD.png

However, if you compare them with the String class equals method, you'll see that they are different. Can you imagine the headache you'd have trying to figure out why name1.equals(name2) is false with these strings? The fact that the two strings look the same visually would be especially misleading. Oh sure, you know they're different now. You have both name1 and name2 right above you here, and you can see that the char values are different. However, you wouldn't have that advantage if you were searching for José in the corporate LDAP directory. I'm not saying your LDAP directory stores information in NFD, but I am saying that the mixed comparison between precomposed and combining sequence text forms is not going to give you good results when using the String equals method. You should use normalized text for many kinds of comparisons, especially when your text comes from two different sources.

When you normalize text to NFD, you decompose all characters that have equivalent combining sequences. A single, visual character might decompose into 2 or more character values in a combining sequence. The character é (\u00E9) becomes the sequence e+´ (\u0065\u0301).

Fortunately, you don't have to remember all the decompositions in Unicode. The java.text.Normalizer class, now available in Java SE 6, can normalize text for you. Here's an example using canonical decomposition (NFD):

String name1 = "Jos\u00E9";       // José with precomposed é
String name2 = Normalizer.normalize(name1, Normalizer.Form.NFD);

The name2 contents are now these values: Jos\u0065\u0301

Now, if your db or other source of text contains NFD text, you're ready for the comparisons.

You have to compare apples and apples, so to speak. So, you have to normalize the text first. Depending on your data sources, you may have to normalize text from one or both sources for a correct comparison.

Of course, the Collator class hides all these normalization details. It uses Normalizer underneath. However, many developers like the freedom of having direct access to the Normalizer functionality.

OK, I've driven this point right into the ground. In summary, NFD is a normalization form. There are 3 other forms. You should know what form your text data is in...it makes comparisons, searches, and other text processing more reliable. You now have direct access to a normalization class that was previously hidden. Use the Normalizer class to prepare text data for storage, transport, or comparisons.

Related Topics >>