Skip to main content

Short test on .hashCode() Performance

Posted by mkarg on January 3, 2010 at 6:29 AM PST

Several APIs demand that the user is implementing the .hashCode() method. The reason is that these APIs are using hash based containers (like HashMap) to have a fast means of managing lots of objects (always comparing objects using .equals() would need endless time). There are lots of standard implementations on the web, so the question is, what performance impact the implemenation of .hashCode() will have. I did some tests and here are the results.

My test program uses a simple, nevertheless non-trivial, "PrimaryKey" class. This class has two integers (a and b). In a loop, I created one million of those "PrimaryKeys" and .put(PK, REF) them into a presized HashMap. Then, I tried to find them again using .get(PK). Both loops will access each of the one million objects exact once, and the time for .put() and .get() is measured separately.

Using this test mockup I tried several algorithms. Here is the interesting result:

public final int hashCode() {
    // return this.a + this.b; // 32876, 33578
    // return this.a ^ this.b; // 48845, 48205
    // return this.a; // 11016, 14266
    // return this.b; // 49658, 49626 (worst case)
    // return this.a * 65536 + this.b; // 1672, 2437 (optimum, less readable)
    // return this.a + this.b * 65536; // 2063, 4063
    return this.a << 16 ^ this.b; // 1672, 2453 (optimum, well readable)
    // return this.a ^ this.b << 16; // 2204, 4609

The left number is the time for .put(), the right for .get(). As you can see, the algorithm has huge impact on performance. While the best implementation needs less than two seconds to write and less than three seconds to read all one million instances (what, BTW is fascinating, since this is Java - the language that everybody calls "slow"), a typically used algorithm like return "this.a ^ this.b;" needs nealry fifty (!) seconds. What a difference! Not to mention the nearly endless time that a simple (and idiotic) "return 0;" will produce... (I cancelled that one after several minutes. Yes: MINUTES!).

Another interesting point is that there is no general optimum: Compare the last two algorithms and you will notice that both are essentially the same. While the first one (the optimum) is shifting a, the seconds one is shifting b. The answer is quite simple: Since our numbers range from 0 to 1000, shifting a leads to a strongly increased spread, while shifting b leads to weak spread increase (and the spread is what brings the performance). As the difference is 100% between those algorithms, this shows that the better one knows its data, the better the implementation can be optimized (and as a result, there is no general optimum).

Conclusion: There is no generally optimal solution, but "return this.a << 16 ^ this.b;" will be optimal in the most cases.

Another question is performance of Strings. Often the above algorithm cannot be used since the "PrimaryKey" class contains Strings. So what to do then? I changed my test to provide some more numbers.

The test program was changed only slightly: While the general mockup is exactly the same, just the implementation of the "PrimaryKey" class is changed to internally store not integers but Strings (and certainly, .hashCode() and .equals() are modified).

Using that changed mockup I tried several String algorithms. Here is the result:

public final int hashCode() {
    // return this.a.hashCode() << 16 + this.b.hashCode(); // 36235, 35642 (worst case)
    // return (this.a + this.b).hashCode(); // 1360, 250 (skew!)
    // return (this.b + this.a).hashCode(); // 1515, 375 (skew!)
    // return (this.a + '|' + this.b).hashCode(); // 1156, 485 (good)
    // return (this.a + "|" + this.b).hashCode(); // 1156, 485 (good)
    return (this.a + "|~" + this.b).hashCode(); // 1156, 266 (optimum?)

I started with the optimal solution of the integer test and was shocked to see that it is actually the worst case for Strings. The reason seems to be that a call to String.hashCode() (possibly any method of any class?) needs some fixed base time independent of the String's length. And that time seems to be very huge. Adding the Strings before calling .hashCode() works better, but actually is highly dependent of your data (since "1" + "11" will be the same than "11" + "1", what might lead to skew spread and (as a result) very indeterministic results (you shouldn't do that in reality). To prevent this problem, the typical solution is to add a "seldomly used character" (mostly the pipe symbol) between both, to get "1|11" and "11|1" (both having different hash codes). The test proofs that the result still is good (while there is no test to proof for the skew, since this was out of scope of my tests). But what really funny is (and what nobody can every explain): Adding two "seldomly used" characters further improves performance! I hardly couldn't believe it, but it is true as you can see above. But be warned: That is just by pure incidence I think. I would nobody give the tip to add another character without having a good explanation for it. So the well known String algorithm (adding a pipe symbol) works pretty well, but due to the actual data you are using it might not be the optimal one.

Conclusion: There is no generally optimal solution, but "return (this.a + '|' + this.b).hashCode();" should provide very good results in most cases.


apache-commons HashCodeBuilder

Have you tried the apache commons-lang HashCodeBuilder? It is what I typically use to build my hash codes. I would be curious to see how it performs in your tests.

These apache-commons Builder utilities are abominations

HashCodeBuilder, EqualsBuilder etc. - they require object allocation, which is a big no-no in critical methods like hashCode() and equals(). And the latest IDEs will generate these methods with MUCH superior quality, trust me (at least Eclipse does).

I find it very interesting that some people just say things ...

I find it very interesting that some people just say things like 'big no-no' and 'MUCH superior quality' without facts.

I did run a set of tests in different JVMs (1.5, 1.6, 1.7 and IBM 1.5).

The tests were very similar to the original post, I tested with 2 million objects, Eclipse (Indigo) generated vs Apache HashCodeBuilder and also Java core HashMap vs. FastMap (from Javolution).

If you use Java HashMap the Eclipse generated will perform better, roughly around 50% faster on all JVMs (except for IBM, for which Apache implementation is always faster).

If you use FastMap (and why wouldn't you if have heavy processes using Maps?) then Apache HashCodeBuilder is almost twice as fast, and the readability is way better.

People that are truly concerned about quality and performance run tests, people that pretend to be concerned say things like 'abomination' and 'trust me'.

Temporary Allocation is no problem with modern JVMs

Our tests did not show any problem with short term object allocation on modern JVMs. Or in other words: It is a Java myth that allocation in loops is a problem in the real world. At least for most applications.

Good question

Regardless of your results, you bring up a very interesting (and fundamental) question. I hope to see more such posts in the future.

More Interesting Stuff

Stay Tuned. We do real world tests like this (i. e. non-academic discussion of what should be a good solution, but what proofed to be fast) every other week. We noticed that there is a big difference in what people (even from Sun and Oracle) write what they think and what our customers are reporting about what it really is like. One idea of this blog is to tell the truth from our view, which might be different from the truth from other's view.

Not measuring hashCode(), bur rather HashMap

At least in your first test, the execution of hashCode() is virtually irrelevant - will consume a tiny amount of CPU cycles for all algorithms. The different performance that you observe should be attributed to the innards of HashMap. Hashcodes like a^b produce a large number of collisions, because (1) the hashcode's most significant bit is not higher than the max msb of a and b, a big problem if your test's PKs are not random samples of the range 0..2^32-1; (2) a^b == b^a, which reduces your hashcode variability by half. HashMap's performance will quickly degenerate from O(1) to O(N) for objects with poor hashcodes (lots of collisions). The a<<16+b hashcode is better, mostly for small values of a/b: you guarantee at least 16 bits of significant bits and no collisions for all values of a/b below 2^16 (for bigger values, the hashcode quality approaches that of a^b).

As or the String test: please, don't EVER allocate objects (in this case, String and javac-generated StringBuilder objects) in a method like hashCode(). The allocation & GC costs may disappear in a synthetic benchmark that does only this in a tight loop (i.e. you basically fill the young-gen with these temp objects, and generational GCs is virtually instantaneous because the young-gen is stuffed with dead objects but GC time depends only on the number of live objects), but things may be very different in a real-world app workload.

Theory and Practice

In theory you are absolutely right. But in practice our customers reported that real world application performance actually was better after implementing our "optimal" solution. What does that mean? That there is a gap in what we all think should happen in real world applications to what actually does happen in the wild. I do not say anything against your arguments, they all are true. We just wanted to report that why ever in the real world there had no reports of negative effects but there had been reports of positive effects. So why ever, our benchmark actually predicted the result we got in the end.

Your short test runs short of performance tips...

Agree with the previous post. Most alternatives you tried are bad practices. For example for your PrimaryKey class with two Integers or Strings a, b. You can try the pattern: p * (a != null ? a.hashCode() : 0) + (b != null ? b.hashCode() : 0); pick p a prime. Most IDE's will generate this for you.