Skip to main content

Unicode 4.0 support in J2SE 1.5

Posted by joconner on April 16, 2004 at 1:08 AM PDT

The JavaTM platform has always supported Unicode, but the newest changes for Unicode 4.0 deserve special comment. Unicode itself has evolved to support over a million different code points or basic characters. The code point range is now 0x0000 through 0x10FFFF.

Some major changes were required for J2SE 1.5 to provide support for all Unicode 4.0 code points. Since changes could potentially affect the Java language itself, the Java Community Process was used to determine how the platform should change. JSR 204 was created for that purpose.

In summary, you'll find the following changes for Unicode 4.0 in Java 1.5:

  • char is a UTF-16 code unit, not a code point
  • new low-level APIs use an int to represent a Unicode code point
  • high level APIs have been updated to understand surrogate pairs
  • a preference towards char sequence APIs instead of char based methods

Hey, that's a code unit, not a character! Java created the char type as a 16-bit entity. At one time a char represented a complete Unicode code point or character. Now, however, a single char clearly cannot represent the entire range of valid Unicode characters. char is now a UTf-16 code unit. Characters in the code point range 0000 through FFFF are still represented by a single char code unit, but supplementary characters (those above FFFF) require two char values. Although I hesitate to say it, you probably won't notice anything until you localize your product for use in Japan, Korea, China, or Taiwan.

New low-level APIs in Character and elsewhere use the int type to represent Unicode code points. Since 16 bits can't represent all possible character values, new APIs were added that utilize 32 bit int values instead. In reality only 21 of the 32 bits are needed. Now you have overloaded methods like Character.isLetter(int ch) and Character.isJavaIdentifierStart(int ch). Of course, new APIs like Character.toChars(int cp) and Character.toCodePoint(char high, char low) make conversions between char arrays and code points easier.

Under the hood, high level APIs now work with surrogate pairs! And you'll not have to learn any new API to get the benefit either. For example, String.toUpperCase() will now work with surrogate pairs that represent supplementary characters. Regular expressions, collation APIs, and all the text rendering in the 2D APIs can now property process, sort, and display supplementary characters without any changes to your application. Of course, you'll need an appropriate font to see those new blocks of characters.

No char is an island. The Java platform will begin to show a clear bias towards "char sequence" APIs instead of single char API. A char sequence is a char[], String, StringBuffer, or other structure that can hold 2 or more char values. Why the bias? Well, a char just isn't everything it used to be. Characters typically aren't standalone entities...they usually come in a group. Characters can be composed of multiple char values include combing marks(accents, tone marks, etc) or surrogate pairs that represent the new supplementary characters. Operations like uppercasing have always had the potential to produce multiple char values. For example, uppercasing the German 'ß' doesn't generate just one char; it creates "SS", which is a string, not a single char at all. So char sequences are definitely the right way to process characters.

What's this mean to you? depends. If your applications have primarily used char based APIs, you may have considerable work to update to char sequenced based methods. However, if your application mostly uses Strings, StringBuffers, or char[]s either as method arguments or as return values, you may not have much to do at all. Most developers will be somewhere in the middle, so it will probably require a litte work from everyone to properly support all of the Unicode characters in your application. Regardless of where you are, you can be assured that the underlying support is there when you decide you need it.

For more information, please see the following:

Related Topics >>