Search |
||
Normalization: Canonical DecompositionPosted by joconner on February 8, 2007 at 1:50 AM PST
You'll recall from a previous blog that normalization is the process of transforming text into a standard form that facilitates reliable searching, sorting, and other text operations. Java SE 6 provides a new normalization API that implements the Unicode standard for normalization: There are 4 normalization forms: NFD, NFC, NFKD, NFKC. Normalization Form D (NFD) is canonical decomposition, and that's probably the best place to start a description. Unicode often provides multiple ways to create a character. For example, you can create an e acute in two ways. You might enter a precomposed character é (LATIN SMALL LETTER E WITH ACUTE). Alternatively, you can enter a combining sequence of two characters e+´ (LATIN SMALL LETTER E + COMBINING ACUTE ACCENT). In your character storage, the precomposed character is the value \u00E9. The combining sequence has two values, \u0065 and \u0301. When a graphical rendering engine shows you the characters, no visual difference should exist. The precomposed character and the combining sequence mean the same thing, and you should see the same glyph on the screen: é. So, if both forms are both semantically and visually the same, what's the big deal? Why am I bringing this up? I'm bringing it up because applications, tools, and databases sometimes store character data in different forms. You can store the name "José" with either precomposed text or combining sequences. If you search for "José" in your corporate database, you probably don't care whether the db stores the name as Let's belabor the point even further. Assume for a moment that you have two String name1 = "Jos\u00E9"; // José with precomposed é String name2 = "Jos\u0065\u0301"; // José with combining sequence e + ´ If you view these strings in a list or other Swing component, the characters have the same visual shape:
However, if you compare them with the When you normalize text to NFD, you decompose all characters that have equivalent combining sequences. A single, visual character might decompose into 2 or more character values in a combining sequence. The character é (\u00E9) becomes the sequence e+´ (\u0065\u0301). Fortunately, you don't have to remember all the decompositions in Unicode. The String name1 = "Jos\u00E9"; // José with precomposed é String name2 = Normalizer.normalize(name1, Normalizer.Form.NFD); The Now, if your db or other source of text contains NFD text, you're ready for the comparisons. You have to compare apples and apples, so to speak. So, you have to normalize the text first. Depending on your data sources, you may have to normalize text from one or both sources for a correct comparison. Of course, the OK, I've driven this point right into the ground. In summary, NFD is a normalization form. There are 3 other forms. You should know what form your text data is in...it makes comparisons, searches, and other text processing more reliable. You now have direct access to a normalization class that was previously hidden. Use the »
Related Topics >>
Open JDK Comments
Comments are listed in date ascending order (oldest first)
|
||
|
|