Skip to main content

Unicode marches on

Posted by arnold on June 6, 2005 at 6:04 AM PDT

I'm working on the 4th edition of The Java Programming Language,
and everyone of course has heard of the major new features. One of the
odd little corners, though, is that Unicode has now grown beyond a
16 bit character standard, and so has lots of interesting new
complications. Trivially, every method in the Character
class that asks about a char now has an overload to which
you can pass an int, and several other methods about strings
and characters take char arrays that can be one or two in
length to hold the larger characters. I'm sure that this will
matter to somebody.

What I always look for, though, is the new version of the
Unicode book. In about 1,000 pages of technical text, one can find
out amazing things about various languages and writing systems.
Our book already has an exposition on title case (which is a third
case beyond lower and upper case, used in Croatian). Case again
rears its head for Georgian which has an upper case, but the upper
case is considered archaic, so while toLowerCase will
translate any upper case Georgian letter to its lower case equivalent,
toUpperCase will not do the opposite.

And so it goes. One of my favorites is \u09F8 (which looks
sort of like a backward N). This is a Bengali currency
symbol that means “one less than the denominator” in a fraction.
I don't know what this is really for, but I've always assumed it
was the way that Bengalis did the equivalent making things look cheaper
by pricing things for $19.99 instead of $20.00. So instead of
writing “19 31/32” or “19 15/16”, they just made a mark that meant
“Doesn't look as expensive as it really is!” Saved a lot of time
and trouble, no doubt. Maybe it means something else, and I may
be just about to find out from some Kind Reader,
but I've thought sometimes that we ought to extend Java that way:
Allow \u09F8 in expressions such as “\u09F8 / i” to mean
“figure out what i is, subtract one, and then divide by
i”. How useful!

Well, today I finally got my copy of the Unicode 4 book, and I've
opened it right up to the choicest bit: \u2600 -
\u26FF. That's where they put the “Miscellaneous Symbols”,
and they are quite a
miscellaneous lot
. This is where you can find the hammer and
sickle sign for communism alongside the peace sign. Want a snowman
in your text? Try \u2603. Writing a horoscope? Here are
your astrology symbols, right next to the religious and political
symbols, such as the one for Iran (\u262C).

I've always figured that some of this got here as pranks.
Maybe in each version the leader of the group or the
person who wins the karaoke contest gets to add one personal symbol.
I don't know, but it might explain why, tucked in between the card suits
and a set of basic Western musical notations, is a symbol for “hot springs”

Well, this version's karaoke star was a recycling nut, I guess.
There are 12 symbols for recycling added. And maybe this year's
drunken evening was at the casino because now we have images for
all sides of a six-sided die. (Not D&D players, clearly, as we
have no images of 20- or 12-sided die sides. Maybe next time.)

Of course, Unicode 4.0 is not just about adding a second kind of
umbrella symbol (\u2614, which is just like the existing one
except that it has little raindrops above it; for versimilitude, I
guess) or a steaming coffee cup (\u2615, so you can have
coffee in your hot springs). That's just where groupies like me
go to look at the seamy underbelly of character encodings.

No, it was mostly about adding new entire character sets, such as
Mongolian, Osmanya (Somali), and Canadian Aboriginal. Many added
sets were historic. Unicode is not only for modern languages, but
also provides standard ways to encode ancient languages. So Linear B (an ancient Greek script) is available, as are Runic and Ogham (old
. They also added Deseret and Shavian, scripts desinged for
writing English phonetically (Shavian was designed by playwright
G.B. Shaw. Mark Twain also invented such a system, but his dropped
redundant letters (“c” can be written with “s” or “k”) and then
reused them. This had the advantage of not requiring an entirely
new typographic system, or a new Unicode block set, so he remains
unrecognized by the Unicode committee.)

The really neat thing about this is that these are all characters.
I mean, real actual characters to Java. You can write your identifiers
in ancient Irish or Scandanvian Runes. Although it raises some
interesting questions, such as: What would Homer name a loop index
variable? Or, Should you use Shavian instead of ASCII for English
variables to help the reader pronounce the names correctly? Or if
you name identifiers using Ugaritic, which is a cuneiform system, do you
need to be able to print out your code on clay tablets?

Unfortunately Java does not (yet) allow you to write numbers in
non-ASCII scripts, but if we are fortunate, we could someday be
able to write our constants using ancient Aegean numbers.

Time to start lobbying!

Related Topics >>