The Source for Java Technology Collaboration
User: Password:
Register | Login help    

Search

Online Books:
java.net on MarkMail:


Unicode marches on

Posted by arnold on June 6, 2005 at 6:04 AM PDT
I'm working on the 4th edition of The Java Programming Language, and everyone of course has heard of the major new features. One of the odd little corners, though, is that Unicode has now grown beyond a 16 bit character standard, and so has lots of interesting new complications. Trivially, every method in the Character class that asks about a char now has an overload to which you can pass an int, and several other methods about strings and characters take char arrays that can be one or two in length to hold the larger characters. I'm sure that this will matter to somebody.

What I always look for, though, is the new version of the Unicode book. In about 1,000 pages of technical text, one can find out amazing things about various languages and writing systems. Our book already has an exposition on title case (which is a third case beyond lower and upper case, used in Croatian). Case again rears its head for Georgian which has an upper case, but the upper case is considered archaic, so while toLowerCase will translate any upper case Georgian letter to its lower case equivalent, toUpperCase will not do the opposite.

And so it goes. One of my favorites is \u09F8 (which looks sort of like a backward N). This is a Bengali currency symbol that means “one less than the denominator” in a fraction. I don't know what this is really for, but I've always assumed it was the way that Bengalis did the equivalent making things look cheaper by pricing things for $19.99 instead of $20.00. So instead of writing “19 31/32” or “19 15/16”, they just made a mark that meant “Doesn't look as expensive as it really is!” Saved a lot of time and trouble, no doubt. Maybe it means something else, and I may be just about to find out from some Kind Reader, but I've thought sometimes that we ought to extend Java that way: Allow \u09F8 in expressions such as “\u09F8 / i” to mean “figure out what i is, subtract one, and then divide by i”. How useful!

Well, today I finally got my copy of the Unicode 4 book, and I've opened it right up to the choicest bit: \u2600 - \u26FF. That's where they put the “Miscellaneous Symbols”, and they are quite a miscellaneous lot. This is where you can find the hammer and sickle sign for communism alongside the peace sign. Want a snowman in your text? Try \u2603. Writing a horoscope? Here are your astrology symbols, right next to the religious and political symbols, such as the one for Iran (\u262C).

I've always figured that some of this got here as pranks. Maybe in each version the leader of the group or the person who wins the karaoke contest gets to add one personal symbol. I don't know, but it might explain why, tucked in between the card suits and a set of basic Western musical notations, is a symbol for “hot springs” (\u2668).

Well, this version's karaoke star was a recycling nut, I guess. There are 12 symbols for recycling added. And maybe this year's drunken evening was at the casino because now we have images for all sides of a six-sided die. (Not D&D players, clearly, as we have no images of 20- or 12-sided die sides. Maybe next time.)

Of course, Unicode 4.0 is not just about adding a second kind of umbrella symbol (\u2614, which is just like the existing one except that it has little raindrops above it; for versimilitude, I guess) or a steaming coffee cup (\u2615, so you can have coffee in your hot springs). That's just where groupies like me go to look at the seamy underbelly of character encodings.

No, it was mostly about adding new entire character sets, such as Mongolian, Osmanya (Somali), and Canadian Aboriginal. Many added sets were historic. Unicode is not only for modern languages, but also provides standard ways to encode ancient languages. So Linear B (an ancient Greek script) is available, as are Runic and Ogham (old Irish). They also added Deseret and Shavian, scripts desinged for writing English phonetically (Shavian was designed by playwright G.B. Shaw. Mark Twain also invented such a system, but his dropped redundant letters (“c” can be written with “s” or “k”) and then reused them. This had the advantage of not requiring an entirely new typographic system, or a new Unicode block set, so he remains unrecognized by the Unicode committee.)

The really neat thing about this is that these are all characters. I mean, real actual characters to Java. You can write your identifiers in ancient Irish or Scandanvian Runes. Although it raises some interesting questions, such as: What would Homer name a loop index variable? Or, Should you use Shavian instead of ASCII for English variables to help the reader pronounce the names correctly? Or if you name identifiers using Ugaritic, which is a cuneiform system, do you need to be able to print out your code on clay tablets?

Unfortunately Java does not (yet) allow you to write numbers in non-ASCII scripts, but if we are fortunate, we could someday be able to write our constants using ancient Aegean numbers.

Time to start lobbying!

Related Topics >> J2SE      
Comments
Comments are listed in date ascending order (oldest first)