How long is your String?
When you ask a
String for its length with
the method will return the number of
charcode units in the String. That's ok, but it may not be telling you exactly what you wanted. There are several ways to determine the length of a String. Here are a few interpretations:
- number of chars in the string
- number of characters in the string
- number of bytes in the string
Calculating the number of chars is easy. You're familiar with String's
length method, and it does the trick.
length returns the number of
char code units in a given
Although true before J2SE 5.0, the point has been made more explicit now: a
char is not necessarily a complete character. Why? Supplementary characters exist in the Unicode charset. These are characters that have code points above the base set, and they have values greater than 0xFFFF. They extend all the way up to 0x10FFFF. That's a lot of characters. In Java, these supplementary characters are represented as surrogate pairs, pairs of
char units that fall in a specific range. The leading or high surrogate value is in the 0xD800 through 0xDBFF range. The trailing or low surrogate value is in the 0xDC00 through 0xDFFF range. What kinds of characters are supplementary? You can find out more from the Unicode site itself. My point is simply this: String's
length method may or may not be what you want. You must understand the implications of supplementary characters now perhaps more than ever before.
length won't tell me home many characters are in a
String, what will? Fortunately, the J2SE 5.0 API has a new
codePointCount(int beginIndex, int endIndex). This method will tell you how many Unicode code points are between the two indices. The index values refer to code unit or
char locations, so
endIndex - beginIndex for the entire
String is equivalent to the String's length. Anyway, here's how you might use the method:
int charLen = myString.length();
int characterLen = myString.codePointCount(0, charLen);
OK, so how many bytes are in a
String? This is the trickiest part of our trick question. The answer depends on what byte-oriented legacy charset you are trying to account for. One typical reason for asking "how many bytes?" is to make sure you're satisfying string length constraints in a database. The String method has a
getBytes method that converts its Unicode characters into a legacy charset, and returns the characters as a
byte. You can find out the various supported charset names in the JDK notes.
Unless you're using supplementary characters, you may never see a difference between the return values of
codePointCount. However, as soon as you drift above U+FFFF, you'll be glad to know about the different ways to determine length. If you send your products to China or Japan, you're almost certain to find a situation in which
codePointCount return different values. Is that important? Maybe...maybe not. However, at least now you'll know why they report different lengths.