The Source for Java Technology Collaboration
User: Password:



John O'Conner

John O'Conner's Blog

How long is your String?

Posted by joconner on August 21, 2005 at 11:04 PM | Comments (6)

When you ask a String for its length with

myString.length()
the method will return the number of char code units in the String. That's ok, but it may not be telling you exactly what you wanted. There are several ways to determine the length of a String. Here are a few interpretations:
  • number of chars in the string
  • number of characters in the string
  • number of bytes in the string

Calculating the number of chars is easy. You're familiar with String's length method, and it does the trick. length returns the number of char code units in a given String.

Although true before J2SE 5.0, the point has been made more explicit now: a char is not necessarily a complete character. Why? Supplementary characters exist in the Unicode charset. These are characters that have code points above the base set, and they have values greater than 0xFFFF. They extend all the way up to 0x10FFFF. That's a lot of characters. In Java, these supplementary characters are represented as surrogate pairs, pairs of char units that fall in a specific range. The leading or high surrogate value is in the 0xD800 through 0xDBFF range. The trailing or low surrogate value is in the 0xDC00 through 0xDFFF range. What kinds of characters are supplementary? You can find out more from the Unicode site itself. My point is simply this: String's length method may or may not be what you want. You must understand the implications of supplementary characters now perhaps more than ever before.

So, if length won't tell me home many characters are in a String, what will? Fortunately, the J2SE 5.0 API has a new String method: codePointCount(int beginIndex, int endIndex). This method will tell you how many Unicode code points are between the two indices. The index values refer to code unit or char locations, so endIndex - beginIndex for the entire String is equivalent to the String's length. Anyway, here's how you might use the method:

  int charLen = myString.length();
  int characterLen = myString.codePointCount(0, charLen);

OK, so how many bytes are in a String? This is the trickiest part of our trick question. The answer depends on what byte-oriented legacy charset you are trying to account for. One typical reason for asking "how many bytes?" is to make sure you're satisfying string length constraints in a database. The String method has a getBytes method that converts its Unicode characters into a legacy charset, and returns the characters as a byte[]. You can find out the various supported charset names in the JDK notes.

Unless you're using supplementary characters, you may never see a difference between the return values of length and codePointCount. However, as soon as you drift above U+FFFF, you'll be glad to know about the different ways to determine length. If you send your products to China or Japan, you're almost certain to find a situation in which length and codePointCount return different values. Is that important? Maybe...maybe not. However, at least now you'll know why they report different lengths.


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • Good discussion! But last time I checked, "UTF-8" wasn't a legacy charset!

    Posted by: jessewilson on August 22, 2005 at 12:11 AM

  • Nice blog John. Ken Arnold has an interesting post about Unicode 4, and I covered similar ground in my blog entry about counting characters earlier this year.


    Tom

    Posted by: tomwhite on August 22, 2005 at 01:09 AM

  • Nice blog but actually, you're mistaken about one small thing.

    If you send your products to China or Japan, you're almost certain to find a situation in which length and codePointCount return different values.

    I don't know about Chinese, but Japanese doesn't use any characters that require surrogate pairs.

    Posted by: golly on August 22, 2005 at 02:33 AM

  • You're right. UTF-8 isn't a legacy charset, and it can be generated by the getBytes method...my oversight as a result of blogging too late on a Sunday night!
    As for Japanese, yes, in fact there are some personal names that must be encoded in the supplementary character area. Additionally, some Japanese software vendors plan on utilizing some of the space for character variants. No joke...Japanese is up there, really. And it's not just for Chinese and Japanese either. Numerous new scripts are there. Check out the Unicode site for more information.
    You can find out a bit more about this new supplementary character support by reading one of my coworkers articles "Supplementary Characters in the Java Platform".

    Posted by: joconner on August 22, 2005 at 09:03 AM

  • I'm wondering: do you feel it was inevitable to break the 16-bit? Or were people a little too generous in the first years? Do we need "ancient greek musical symbols", "box drawing symbols" or "supplemental arrows A and B" ?

    Posted by: mernst on August 23, 2005 at 01:50 AM

  • Beside bytes, code points and characters you may also want to care about visible glyphs (combine chars and double size) which is important on console, and you might also care about the size in a given font.

    Most systems care about the bytes or about the visible glyph number. Some systems care about the code points.

    Greetings
    Bernd

    PS: i have a german blog article on java 5 UTF-16 news:
    http://itblog.eckenfels.net/archives/17-Java-und-Unicode.html

    Posted by: ecki on August 23, 2005 at 05:52 AM





Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds