|
|
||
John O'Conner's BlogUnicode 4.0 support in J2SE 1.5Posted by joconner on April 16, 2004 at 01:08 AM | Comments (1)The JavaTM platform has always supported Unicode, but the newest changes for Unicode 4.0 deserve special comment. Unicode itself has evolved to support over a million different code points or basic characters. The code point range is now 0x0000 through 0x10FFFF. Some major changes were required for J2SE 1.5 to provide support for all Unicode 4.0 code points. Since changes could potentially affect the Java language itself, the Java Community Process was used to determine how the platform should change. JSR 204 was created for that purpose. In summary, you'll find the following changes for Unicode 4.0 in Java 1.5:
Hey, that's a code unit, not a character! Java created the
New low-level APIs in Under the hood, high level APIs now work with surrogate pairs! And you'll not have to learn any new API to get the benefit either. For example, String.toUpperCase() will now work with surrogate pairs that represent supplementary characters. Regular expressions, collation APIs, and all the text rendering in the 2D APIs can now property process, sort, and display supplementary characters without any changes to your application. Of course, you'll need an appropriate font to see those new blocks of characters. No char is an island. The Java platform will begin to show a clear bias towards "char sequence" APIs instead of single char API. A char sequence is a char[], String, StringBuffer, or other structure that can hold 2 or more char values. Why the bias? Well, a char just isn't everything it used to be. Characters typically aren't standalone entities...they usually come in a group. Characters can be composed of multiple char values include combing marks(accents, tone marks, etc) or surrogate pairs that represent the new supplementary characters. Operations like uppercasing have always had the potential to produce multiple char values. For example, uppercasing the German 'ß' doesn't generate just one char; it creates "SS", which is a string, not a single char at all. So char sequences are definitely the right way to process characters. What's this mean to you? Well...it depends. If your applications have primarily used char based APIs, you may have considerable work to update to char sequenced based methods. However, if your application mostly uses Strings, StringBuffers, or char[]s either as method arguments or as return values, you may not have much to do at all. Most developers will be somewhere in the middle, so it will probably require a litte work from everyone to properly support all of the Unicode characters in your application. Regardless of where you are, you can be assured that the underlying support is there when you decide you need it. For more information, please see the following:
Bookmark blog post: CommentsComments are listed in date ascending order (oldest first) | Post Comment
| ||
|
|