Search |
||
Counting CharactersPosted by tomwhite on March 22, 2005 at 2:10 PM PST
During a panel discussion at the 1999 JavaOne conference Bill Joy, talking about the things he didn't like about Java, stated that he "didn't want
It was Unicode 3.1 that introduced supplementary characters: characters that require more than 16 bits to represent them. (These characters are rare - think obscure mathematical symbols and languages no longer used outside academia - like Linear B.) Since Java tracks the Unicode standard, the next release - 1.5 - had to solve the problem of how to represent these characters given that a Java
For reasons of backward compatibility the fix broke the one-to-one correspondence between Unicode characters and Java Java chars
The following class counts the number of
public class CharCounter {
public int count(Reader in) throws IOException {
char[] buffer = new char[4096];
int count = 0;
int len;
while((len = in.read(buffer)) != -1) {
count += len;
}
return count;
}
}
Running the following snippet of code exercises
String text = "\u0041\u00DF\u6771\uD801\uDC00";
CharCounter cc = new CharCounter();
System.out.println(cc.count(new StringReader(text)));
which correctly prints
5 Unicode CharactersThe text string is a sequence of four Unicode characters (the same as the ones in the table in the above mentioned article, which also gives representative glyphs):
chars to represent it (\uD801\uDC00). This is why the four characters are represented in five chars.
To count Unicode characters we need to adjust the count using the Character.isHighSurrogate() method introduced in Java 1.5 to test whether a char is the first in a surrogate pair:
public class UnicodeCharacterCounter {
public int count(Reader in) throws IOException {
char[] buffer = new char[4096];
int count = 0;
int len;
while((len = in.read(buffer)) != -1) {
count += len;
for (int i = 0; i < len; i++) {
if (Character.isHighSurrogate(buffer[i])) {
count--;
}
}
}
return count;
}
}
The test code
String text = "\u0041\u00DF\u6771\uD801\uDC00";
UnicodeCharacterCounter ucc = new UnicodeCharacterCounter();
System.out.println(ucc.count(new StringReader(text)));
now prints the number of Unicode characters:
4 User Characters
For this application - counting characters - there is actually a Java library that can help out by hiding the low-level implementation details of supplementary characters:
public class UserCharacterCounter {
public int count(String text) {
int count = 0;
BreakIterator iter = BreakIterator.getCharacterInstance();
iter.setText(text);
while (iter.next() != BreakIterator.DONE) {
count++;
}
return count;
}
}
The test code is
String text = "\u0041\u00DF\u6771\uD801\uDC00";
UserCharacterCounter ucc = new UserCharacterCounter();
System.out.println(ucc.count(text));
correctly giving the value 4.
The
UserCharacterCounter ucc = new UserCharacterCounter();
System.out.println(ucc.count("g\u0301"));
prints 1 for LATIN SMALL LETTER G followed by COMBINING ACUTE ACCENT. The two Unicode characters combine into a single user character - "g acute".
Conclusion
Even as seemingly simple a concept as "character" has a least three interpretations (there are doubtless more). Happily, Java can cope with all three. (How do other languages fare?) However, I have a nagging question as Java approaches its tenth birthday: if Java were being invented today would a »
Related Topics >>
J2SE Comments
Comments are listed in date ascending order (oldest first)
|
||
|