|
|
||
Ethan Nicholas's BlogPerformance ArchivesAll about intern()Posted by enicholas on June 26, 2006 at 02:16 PM | Permalink | Comments (13)Strings are a fundamental part of any modern programming language, every bit as important as numbers. So you'd think that Java programmers would go out of their way to have a solid understanding of them -- and sadly, that isn't always the case. I was going through the source code to Xerces (the XML parser included in Java) today, when I found a very surprising line: com.sun.org.apache.xerces.internal.impl.XMLScanner:395 There are a number of strings defined like this, and every one of them is being interned. So what exactly is intern()? Well, as you no doubt know, there are two different ways to compare objects in Java. You can use the == operator, or you can use the equals() method. The == operator compares whether two references point to the same object, whereas the equals() method compares whether two objects contain the same data. One of the first lessons you learn in Java is that you should usually use equals(), not ==, to compare two strings. If you compare, say, new String("Hello") == new String("Hello"), you will in fact receive false, because they are two different string instances. If you use equals() instead, you will receive true, just as you'd expect. Unfortunately, the equals() method can be fairly slow, as it involves a character-by-character comparison of the strings. Since the == method compares identity, all it has to do is compare two pointers to see if they are the same, and obviously it will be much faster than equals(). So if you're going to be comparing the same strings repeatedly, you can get a significant performance advantage by reducing it to an identity comparison rather than an equality comparison. The basic algorithm is: 1) Create a hash set of Strings After following this algorithm, you are guaranteed that if two strings contain the same characters, they are also the same instance. This means that you can safely compare strings using == rather than equals(), gaining a significant performance advantage with repeated comparisons. Fortunately, Java already includes an implementation of the algorithm above. It's the intern() method on java.lang.String. new String("Hello").intern() == new String("Hello").intern() returns true, whereas without the intern() calls it returns false. So why was I so surprised to see protected final static String fVersionSymbol = "version".intern(); in the Xerces source code? Obviously this string will be used for many comparisons, doesn't it make sense to intern it? Sure it does. That's why Java already does it. All constant strings that appear in a class are automatically interned. This includes both your own constants (like the above "version" string) as well as other strings that are part of the class file format -- class names, method and field signatures, and so forth. It even extends to constant string expressions: "Hel" + "lo" is processed by javac exactly the same as "Hello", and "Hel" + "lo" == "Hello" will return true. So the result of calling intern() on a constant string like "version" is by definition going to be the exact same string you passed in. "version" == "version".intern(), always. You only need to intern strings when they are not constants, and you want to be able to quickly compare them to other interned strings. There can also be a memory advantage to interning strings -- you only keep one copy of the string's characters in memory, no matter how many times you refer to it. That's the main reason why class file constant strings are interned: think about how many classes refer to (say) java.lang.Object. The name of the class java.lang.Object has to appear in every single one of those classes, but thanks to the magic of intern(), it only appears in memory once. The bottom line? intern() is a useful method and can make life easier -- but make sure that you're using it responsibly. Understanding Weak ReferencesPosted by enicholas on May 04, 2006 at 05:06 PM | Permalink | Comments (19)Some time ago I was interviewing candidates for a Senior Java Engineer position. Among the many questions I asked was "What can you tell me about weak references?" I wasn't expecting a detailed technical treatise on the subject. I would probably have been satisfied with "Umm... don't they have something to do with garbage collection?" I was instead surprised to find that out of twenty-odd engineers, all of whom had at least five years of Java experience and good qualifications, only two of them even knew that weak references existed, and only one of those two had actual useful knowledge about them. I even explained a bit about them, to see if I got an "Oh yeah" from anybody -- nope. I'm not sure why this knowledge is (evidently) uncommon, as weak references are a massively useful feature which have been around since Java 1.2 was released, over seven years ago. Now, I'm not suggesting you need to be a weak reference expert to qualify as a decent Java engineer. But I humbly submit that you should at least know what they are -- otherwise how will you know when you should be using them? Since they seem to be a little-known feature, here is a brief overview of what weak references are, how to use them, and when to use them. Strong references First I need to start with a refresher on strong references. A strong reference is an ordinary Java reference, the kind you use every day. For example, the code:
creates a new When strong references are too strong It's not uncommon for an application to use classes that it can't reasonably extend. The class might simply be marked What happens when you need to keep track of extra information about the object? In this case, suppose we find ourselves needing to keep track of each
This might look okay on the surface, but the strong reference to Another common problem with strong references is caching, particular with very large structures like images. Suppose you have an application which has to work with user-supplied images, like the web site design tool I work on. Naturally you want to cache these images, because loading them from disk is very expensive and you want to avoid the possibility of having two copies of the (potentially gigantic) image in memory at once. Because an image cache is supposed to prevent us from reloading images when we don't absolutely need to, you will quickly realize that the cache should always contain a reference to any image which is already in memory. With ordinary strong references, though, that reference itself will force the image to remain in memory, which requires you (just as above) to somehow determine when the image is no longer needed in memory and remove it from the cache, so that it becomes eligible for garbage collection. Once again you are forced to duplicate the behavior of the garbage collector and manually determine whether or not an object should be in memory. Weak references A weak reference, simply put, is a reference that isn't strong enough to force an object to remain in memory. Weak references allow you to leverage the garbage collector's ability to determine reachability for you, so you don't have to do it yourself. You create a weak reference like this:
and then elsewhere in the code you can use To solve the "widget serial number" problem above, the easiest thing to do is use the built-in Reference queues Once a The Different degrees of weakness Up to this point I've just been referring to "weak references", but there are actually four different degrees of reference strength: strong, soft, weak, and phantom, in order from strongest to weakest. We've already discussed strong and weak references, so let's take a look at the other two. Soft references A soft reference is exactly like a weak reference, except that it is less eager to throw away the object to which it refers. An object which is only weakly reachable (the strongest references to it are
Phantom references A phantom reference is quite different than either The difference is in exactly when the enqueuing happens. What good are Second, With Arguably, the Conclusion I'm sure some of you are grumbling by now, as I'm talking about an API which is nearly a decade old and haven't said anything which hasn't been said before. While that's certainly true, in my experience many Java programmers really don't know very much (if anything) about weak references, and I felt that a refresher course was needed. Hopefully you at least learned a little something from this review. | ||
|
|