Skip to main content

Counting Characters

Posted by tomwhite on March 22, 2005 at 2:10 PM PST

During a panel discussion at the 1999 JavaOne conference Bill Joy, talking about the things he didn't like about Java, stated that he "didn't want char to be [a] numerical type". He was outvoted on this one, and as we know the Java char was blessed as a 16-bit numeric primitive type. I don't know whether Joy objected to it being in the numeric type hierarchy (so any char can be cast to an int for example), or whether he didn't like it being a primitive, but what is for sure is that the char's limited width has finally had an impact in the way support for Unicode 4.0 is implemented in Java 1.5.

It was Unicode 3.1 that introduced supplementary characters: characters that require more than 16 bits to represent them. (These characters are rare - think obscure mathematical symbols and languages no longer used outside academia - like Linear B.) Since Java tracks the Unicode standard, the next release - 1.5 - had to solve the problem of how to represent these characters given that a Java char is a 16-bit type. The details of how this was achieved are explained in the excellent article from Sun: Supplementary Characters in the Java Platform, which is required reading for all serious Java programmers. (You might want to read these two introductions first if you're not up on Unicode: Joel Spolsky and Tim Bray.)

For reasons of backward compatibility the fix broke the one-to-one correspondence between Unicode characters and Java chars. A char should now be interpreted as a UTF-16 code unit.
The breaking of this correspondence means applications that deal with individual characters may need to be changed - the above article has details. This got me thinking about how you count the number of characters in a piece of text.

Java chars

The following class counts the number of char primitives from the passed in Reader. This is an efficient way to count the number of Unicode characters in Java releases before 1.5.

public class CharCounter {

  public int count(Reader in) throws IOException {
    char[] buffer = new char[4096];
    int count = 0;
    int len;
    while((len = in.read(buffer)) != -1) {
      count += len;
    }
    return count;
  }

}

Running the following snippet of code exercises CharCounter:

    String text = "\u0041\u00DF\u6771\uD801\uDC00";
    CharCounter cc = new CharCounter();
    System.out.println(cc.count(new StringReader(text)));

which correctly prints

5

Unicode Characters

The text string is a sequence of four Unicode characters (the same as the ones in the table in the above mentioned article, which also gives representative glyphs):

  1. U+0041 (LATIN CAPITAL LETTER A)
  2. U+00DF (LATIN SMALL LETTER SHARP S)
  3. U+6771 (a character from the CJK Unified Ideographs range)
  4. U+10400 (DESERET CAPITAL LETTER LONG I)

This last character is a supplementary character and needs two chars to represent it (\uD801\uDC00). This is why the four characters are represented in five chars.
To count Unicode characters we need to adjust the count using the Character.isHighSurrogate() method introduced in Java 1.5 to test whether a char is the first in a surrogate pair:

public class UnicodeCharacterCounter {

  public int count(Reader in) throws IOException {
    char[] buffer = new char[4096];
    int count = 0;
    int len;
    while((len = in.read(buffer)) != -1) {
      count += len;
      for (int i = 0; i < len; i++) {
        if (Character.isHighSurrogate(buffer[i])) {
          count--;
        }
      }

    }
    return count;
  }

}

The test code

    String text = "\u0041\u00DF\u6771\uD801\uDC00";
    UnicodeCharacterCounter ucc = new UnicodeCharacterCounter();
    System.out.println(ucc.count(new StringReader(text)));

now prints the number of Unicode characters:

4

User Characters

For this application - counting characters - there is actually a Java library that can help out by hiding the low-level implementation details of supplementary characters: java.test.BreakIterator. (In other applications you may still need to test explicitly for supplementary characters, as in the previous example.) Here is an example for counting the number of characters in a string.

public class UserCharacterCounter {
  public int count(String text) {
    int count = 0;
    BreakIterator iter = BreakIterator.getCharacterInstance();
    iter.setText(text);
    while (iter.next() != BreakIterator.DONE) {
      count++;
    }
    return count;
  }
}

The test code is

    String text = "\u0041\u00DF\u6771\uD801\uDC00";
    UserCharacterCounter ucc = new UserCharacterCounter();
    System.out.println(ucc.count(text));

correctly giving the value 4.

The BreakIterator character iterator generally conforms to what a user expects a character to be. In some circumstances it diverges from the Unicode character definition of what a character is. For example,

    UserCharacterCounter ucc = new UserCharacterCounter();
    System.out.println(ucc.count("g\u0301"));

prints 1 for LATIN SMALL LETTER G followed by COMBINING ACUTE ACCENT. The two Unicode characters combine into a single user character - "g acute".

Conclusion

Even as seemingly simple a concept as "character" has a least three interpretations (there are doubtless more). Happily, Java can cope with all three. (How do other languages fare?) However, I have a nagging question as Java approaches its tenth birthday: if Java were being invented today would a char be 32 bits?

Related Topics >>