The Source for Java Technology Collaboration
User: Password:



Ethan Nicholas

Ethan Nicholas's Blog

All about intern()

Posted by enicholas on June 26, 2006 at 02:16 PM | Comments (13)

Strings are a fundamental part of any modern programming language, every bit as important as numbers. So you'd think that Java programmers would go out of their way to have a solid understanding of them -- and sadly, that isn't always the case.

I was going through the source code to Xerces (the XML parser included in Java) today, when I found a very surprising line:

com.sun.org.apache.xerces.internal.impl.XMLScanner:395
protected final static String fVersionSymbol = "version".intern();

There are a number of strings defined like this, and every one of them is being interned. So what exactly is intern()? Well, as you no doubt know, there are two different ways to compare objects in Java. You can use the == operator, or you can use the equals() method. The == operator compares whether two references point to the same object, whereas the equals() method compares whether two objects contain the same data.

One of the first lessons you learn in Java is that you should usually use equals(), not ==, to compare two strings. If you compare, say, new String("Hello") == new String("Hello"), you will in fact receive false, because they are two different string instances. If you use equals() instead, you will receive true, just as you'd expect. Unfortunately, the equals() method can be fairly slow, as it involves a character-by-character comparison of the strings.

Since the == method compares identity, all it has to do is compare two pointers to see if they are the same, and obviously it will be much faster than equals(). So if you're going to be comparing the same strings repeatedly, you can get a significant performance advantage by reducing it to an identity comparison rather than an equality comparison. The basic algorithm is:

1) Create a hash set of Strings
2) Check to see if the String you're dealing with is already in the set
3) If so, return the one from the set
4) Otherwise, add this string to the set and return it

After following this algorithm, you are guaranteed that if two strings contain the same characters, they are also the same instance. This means that you can safely compare strings using == rather than equals(), gaining a significant performance advantage with repeated comparisons.

Fortunately, Java already includes an implementation of the algorithm above. It's the intern() method on java.lang.String. new String("Hello").intern() == new String("Hello").intern() returns true, whereas without the intern() calls it returns false.

So why was I so surprised to see protected final static String fVersionSymbol = "version".intern(); in the Xerces source code? Obviously this string will be used for many comparisons, doesn't it make sense to intern it?

Sure it does. That's why Java already does it. All constant strings that appear in a class are automatically interned. This includes both your own constants (like the above "version" string) as well as other strings that are part of the class file format -- class names, method and field signatures, and so forth. It even extends to constant string expressions: "Hel" + "lo" is processed by javac exactly the same as "Hello", and "Hel" + "lo" == "Hello" will return true.

So the result of calling intern() on a constant string like "version" is by definition going to be the exact same string you passed in. "version" == "version".intern(), always. You only need to intern strings when they are not constants, and you want to be able to quickly compare them to other interned strings.

There can also be a memory advantage to interning strings -- you only keep one copy of the string's characters in memory, no matter how many times you refer to it. That's the main reason why class file constant strings are interned: think about how many classes refer to (say) java.lang.Object. The name of the class java.lang.Object has to appear in every single one of those classes, but thanks to the magic of intern(), it only appears in memory once.

The bottom line? intern() is a useful method and can make life easier -- but make sure that you're using it responsibly.


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment


  • This is a nice entry.


    I forget when interned Strings can become garbage! -- Ever?

    Posted by: steevcoco on June 26, 2006 at 06:39 PM

  • String interning can become a performance problem if used frequently in different threads, as intern() enters a critical section in the VM.

    Also, IIRC, interned strings get added to the permanent generation so you may have issues with that (out of memory conditions).

    This is probably what you meant in your last line with "using it responsibly."

    Posted by: goron on June 27, 2006 at 12:12 AM

  • Maybe unrelated, but for static final Strings declared like that, the compiler doesn't substitute the value of the String into client classes.

    So if you decided to sub-class XMLScanner in your app, when you reference fVersionSymbol, you'd get the a real reference and not the value.

    Later on, if if the Xerces guys decide to change the value of the variable, they'll see the new value without a recompile. If they hadn't used this, your client class would still have the old value "version" in it.

    Wasn't this one of the reasons one why Sun added Enums to JDK 5?

    Posted by: classnotfound on June 27, 2006 at 03:23 AM

  • Err sorry, typo city: they'll = you'll.

    Posted by: classnotfound on June 27, 2006 at 03:28 AM

  • Hi, Beware ! intern is dangerous, because at least on older Sun JDK's it's performance degrades quickly with the number of Strings you intern. Using a simple HashMap is typically quicker. intern() also pust the strings into permspace, which means that only a full GC will reclaim them. Regards, Markus (https://www.sdn.sap.com/irj/sdn/weblogs?blog=/pub/u/6389)

    Posted by: kohlerm on June 27, 2006 at 07:44 AM

  • This website has some good information about intern() in regards to garbage collection: http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html

    Essentialy, the intern() contract is to ensure that canonical comparisons will return true/false. There is no implementation requirement for the "pool" of strings to maintain every value that is ever stored in it.

    I tried to look up the implementation details in the java source code, but the inetern() method was declared as 'native' and I didn't feel like searching through JRE dlls.

    Posted by: haydensteep on June 27, 2006 at 07:52 AM

  • "Unfortunately, the equals() method can be fairly slow, as it involves a character-by-character comparison of the strings."


    Since the String equals() method first compares the references and returns true if it is the same object, the performance effect of using equals() is probably not as bad as it first seems. ?

    Posted by: sonnygill on June 27, 2006 at 07:24 PM

  • @goron, yes but not in the case of Xerces because
    string constant are already "interned" so
    the call to .intern() is a no op.

    As classnotfound says, the idea here is only to call
    a method in order to avoid constant folding
    by the compiler.

    Rémi Forax

    Posted by: forax on June 28, 2006 at 12:27 AM


  • ========= SOURCE ============
    boolean b;
    String s1 = ss1.intern();
    String s2 = ss2.intern();

    long start = new Date().getTime();
    for (int i = 0; i < 999999; i++) {
    b = s1 == ss2.intern();
    }
    long end = new Date().getTime();
    System.out.println("1) intern each iteration " + (+end - start));


    start = new Date().getTime();
    for (int i = 0; i < 999999; i++) {
    b = s1 == s2;
    }
    end = new Date().getTime();
    System.out.println("2) one intern " + (+end - start));


    start = new Date().getTime();
    for (int i = 0; i < 999999; i++) {
    b = ss1.equals(ss2);
    }


    end = new Date().getTime();
    System.out.println("3) equals " + (+end - start));
    ========= SOURCE ============

    Results:
    java.vm.name=BEA WebLogic JRockit(TM) 1.4.2_05 JVM R24.4.0-1
    1) intern each iteration 7468
    2) one intern 0
    3) equals 7422


    java.vm.version=1.4.2_10-b03
    java.vm.name=Java HotSpot(TM) Client VM
    1) intern each iteration 62311
    2) one intern 0
    3) equals 13000

    Posted by: zzulus on June 28, 2006 at 02:30 AM

  • Although the equals method is the obvious choice when you need to compare the actual character by character values of String objects, you should be aware that this method does not provide the right answer for all needs. I agree completely with your assessment of the intern method and the difference between equality as measured by the == operator vs the equals method. You can read my complete follow-up thoughts in my own java.net blog entry entitled String's equals method isn't always enough.

    Posted by: joconner on June 28, 2006 at 12:06 PM

  • I would normally interpret
    final static String foo = "foo".intern();
    as an attempt to make the qualified
    name WhateverClass.foo not be a
    compile-time constant expression.
    (see 15.28 of the JLS)

    It can be useful to do this, and this
    technique is arguable less ugly than
    final static String foo = new String("foo");

    Posted by: tuc on June 28, 2006 at 12:28 PM

  • Regarding concerns about the performance of equals(): Check the source code[1]:
    src/share/classes/java/lang/String.java

    String.equals() does everything it can to avoid the char-by-char comparison.
    It is only when you have two Strings of the same length that equals() even
    starts looking at the chars, and it will bail out on the first mismatch.


    [1] download Mustang (Java SE 6)

    Posted by: timbell on June 28, 2006 at 12:39 PM

  • You leave one thing out: how is this string being used? Without that, it is impossible to judge whether interning is a good idea here. I suspect it isn't.

    Posted by: mernst on July 03, 2006 at 09:24 AM



Only logged in users may post comments. Login Here.


Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds