Search |
||
Normalization: Canonical CompositionPosted by joconner on February 11, 2007 at 3:45 PM PST
Continuing the discussion about Unicode normalization, I'll briefly describe Normalization Form C (NFC). NFC is canonical decomposition followed by canonical composition. It's the form you see the most all over the web, etc. In fact, NFC is the preferred encoding for the world wide web. Why? Well, the form is slightly more compact than a decomposed form containing combining sequences, and most people simply feel more comfortable with the 1:1 matching between Unicode code points and glyphs instead of the *:1 ratio using decomposed forms. NFC tries to find a single precomposed character for all combining sequences in a string. The mappings from combining sequences to precomposed forms are defined by the Unicode standard. Some explicit mappings won't exist for some combining sequences, but the final normalized string will contain single character code points wherever possible. Java SE 6 provides the String anotherName = "Jos\u0065\u0301"; String nfcName = Normalizer.normalize(anotherName, Normalizer.Form.NFC); Now, In the next couple weeks, Sun will publish another Core Java Technology Tech Tip. The next one contains a great tip from Sergey Groznyh, a JDK programmer at Sun. He describes how to use the »
Related Topics >>
Open JDK Comments
Comments are listed in date ascending order (oldest first)
|
||
|
|