Skip to main content

Normalization: Canonical Composition

Posted by joconner on February 11, 2007 at 3:45 PM PST

Continuing the discussion about Unicode normalization, I'll briefly describe Normalization Form C (NFC). NFC is canonical decomposition followed by canonical composition. It's the form you see the most all over the web, etc. In fact, NFC is the preferred encoding for the world wide web. Why? Well, the form is slightly more compact than a decomposed form containing combining sequences, and most people simply feel more comfortable with the 1:1 matching between Unicode code points and glyphs instead of the *:1 ratio using decomposed forms.

NFC tries to find a single precomposed character for all combining sequences in a string. The mappings from combining sequences to precomposed forms are defined by the Unicode standard. Some explicit mappings won't exist for some combining sequences, but the final normalized string will contain single character code points wherever possible.

Java SE 6 provides the java.text.Normalizer class to help you put text into NFC:

String anotherName = "Jos\u0065\u0301";
String nfcName = Normalizer.normalize(anotherName, Normalizer.Form.NFC);

Now, nfcName contains the single \u00E9 character, é.

In the next couple weeks, Sun will publish another Core Java Technology Tech Tip. The next one contains a great tip from Sergey Groznyh, a JDK programmer at Sun. He describes how to use the Normalizer class. If you haven't already, you might want to subscribe to the tech tips newsletter, available when you become a Sun Developer Network member.

Related Topics >>