Skip to main content

Java SE 5.0 Update 8's cool performance fixes

Posted by opinali on August 11, 2006 at 1:17 PM PDT

Java SE 5.0 Update 8 is available. Besides the traditional few dozens of bugfixes, this release closes a number of important performance bugs in the JIT compiler, GC and runtime. So even if you play more in the conservative side, it's a great time to evaluate and upgrade your Java runtime.

One of the fixed perf bugs is Bug 4803284: Bad performance when HotSpot cannot optimize polymorphic calls, which I reported in January 2003 against JDK 1.4.1. That bug received a partial fix with Tiger-b28, when Sun promised the full solution "in next major release" (Mustang), splitting this to Bug 5074577. There's also a dupe, Bug 4898726: Nonstatic method calls take lot more time than static method calls. Well, it seems Sun kept the promise and these bugs are all fully solved now!

I obviously had to re-run my MethodBench (included in the bug report) to see if the new fix did any difference (as the partial fix already made me happy). And it does, only for HotSpot Server:

 

JDK 5.0 Update 7:

D:\Java\Bench\Method>java -server -Xbatch MethodHard
*** Testing (Hard) ***
Interf  TPoly   Poly    Virtual Final   Static  Native
35,530  25,682  01,155  01,155  01,146  01,138  259,300

JDK 5.0 Update 8:

D:\Java\Bench\Method>java -server -Xbatch MethodHard
*** Testing (Hard) ***
Interf  TPoly   Poly    Virtual Final   Static  Native
14,754  23,599  01,134  01,130  01,134  01,134  254,700

JDK 6.0 build 94:

D:\Java\Bench\Method>java -server -Xbatch MethodHard
*** Testing (Hard) ***
Interf  TPoly   Poly    Virtual Final   Static  Native
13,711  25,318  01,142  01,146  01,146  01,155  67,980

 

(The Reflection test is very slow in the "Hard" case, and I'm in a hurry now.)

The improvement is in the Interf test (interface calls), where both 5.0u8 and Mustang beat u7 in 14ns vs. 35ns per call. Other tests show only marginal improvements. Well, in a world full of complex frameworks abusing of interfaces, this optimization may be significant. And this is for a "hard" test where the compiler cannot infer a single concrete type (devirtualization). See the relevant benchmark code:

 

public void bench ()
{
int calls = operations / 1000;
// Note: HelperB and HelperD are unrelated classes (don't inherit each other),
// but both implement the polyHardF() method defined by HelperInterface.
HelperInterface helper = null, h1 = new HelperB(), h2 = new HelperD();

for (int i = 0; i < 1000; ++i)
{
switch (i / 500)
{
// Here, I periodically reassign the helper variable to either HelperB or HelperD,
// making the call-site below really polymorphic, which kills easy dataflow-based
// dervirtualization.  Notice also that 50% of all calls will go to each class,
// so profiling won't let the JIT decide to optimize one case (e.g. helperB) in
// detriment of the other with a type-check + inlining only for the hot case.
case 0: helper = h1; break;
case 1: helper = h2; break;
}

for (int j = 0; j < calls; ++j)
// This is the polymorphic, interface-based, call-site to optimize.
// The invoked methods perform a trivial operation, but one that
// cannot be optimized out (it updates a static variable).
helper.polyHardF(i);
}
}

 

From the bugtrack comments, it seems that this fix is not specific to interface calls. It's only an artifact of my benchmark that the limitation of the optimizer was only caught in the MethodHard/Interf test. But since interface-based calls are by far the hardest ones to optimize (due to multiple inheritance), this is good news for complex OO code overall.

How does HotSpot compile this code, in 5.0u8 and Mustang? I looked at the code (using Mustang's debug build and -XX:+PrintOptoAssembly) and saw that it actually inlines both HelperB's and HelperD's implementations of polyHardF(). It's a strong demonstration of several optimizations combined: the compiler must determine that there are only two possible concrete types for the call-site, and that both deserve inlining... and there's more - let's check the code:

 

# Start of the inner loop: int j = 0...

0d8   B12: #    B13 <- B11  Freq: 15356
0d8     XOR    EBP,EBP
0da

# ...Loop body...

0da   B13: #    B29 B14 <- B12 B18      Loop: B13-B18 stride: not constant  Freq: 30715
0da     MOV    ECX,[ESP + #12]
0de     MOV    EBX,[ECX + #4]
0e1     NullCheck ECX
0e1

# The "payload" code for the inlined methods is here, because both
# implementations of testHardF() are identical = { ++x; } where x is a static int
# (and it's the same x, in a shared superclass).  HotSpot was smart enough to
# merge part of the common code (the read and the increment) and move it up.

0e1   B14: #    B16 B15 <- B13  Freq: 30715
0e1     MOV    ECX,#344
0e6     MOV    EDI,[ECX + precise klass Helper: 0x00a3e2a8:Constant:exact *]
0ec     ADD    EDI,[ESP + #16]

# Guard code: helper instanceof HelperB?

0f0     CMPu   EBX,precise klass HelperB: 0x00a3c3d0:Constant:exact *
0f6     Jne,us B16  P=0.266667 C=-1.000000
0f6

# Inlined code for (the rest of) HelperB.testHardF().

# I don't understand completely the need for this code, but I guess HotSpot
# wants to keep track of stack frames, and/or produce a trap if there is some
# change in the class hierarchy that invalidates the inlining.

0f8   B15: #    B18 <- B14  Freq: 22524.3
0f8     MOV    ECX,[ESP + #12]
0fc     #checkcastPP of ECX
0fc     MOV    [ESP + #12],ECX
100     MOV    EBX,#344

# Here the new value for x is stored -- the read and increment were moved up
# for sharing, but the write must be performed in the right place (I think)
# to not produce illegal behaviors: e.g., if helper is neither HelperB nor HelperD,
# quite possible if it is null (the benchmark code makes difficult to prove the
# never-null property of helper), then a NullPointerException should be raised
# and x should not be updated due to out-of-order code.

105     MOV    [EBX + precise klass Helper: 0x00a3e2a8:Constant:exact *],EDI
10b     JMP,s  B18

# Guard code: helper instanceof HelperD?

10b
10d   B16: #    B26 B17 <- B14  Freq: 8190.66
10d     CMPu   EBX,precise klass HelperD: 0x00a3c438:Constant:exact *
113     Jne,us B26  P=0.000001 C=-1.000000
113

# Inlined code for (the rest of) HelperD.testHardF().
# Same stuff as before.  It's sad that, both methods being exactly identical,
# HotSpot inlines and merges only part of their bodies, but doesn't do it
# neither for this bookkeeping code nor for the store.  Quite likely HS doesn't
# want that, so it doesn't lose the ability to walk stacks precisely, in events
# like GC, OSR, and exception stack-trace filling (don't you hate JVMs that don't
# report exact line numbers for all frames, in compiled code?).

115   B17: #    B18 <- B16  Freq: 8190.65
115     MOV    ECX,[ESP + #12]
119     #checkcastPP of ECX
119     MOV    [ESP + #12],ECX
11d     MOV    EBX,#344
122     MOV    [EBX + precise klass Helper: 0x00a3e2a8:Constant:exact *],EDI
122

# ...End of the inner loop: ++j; go back if j < calls

128   B18: #    B13 B19 <- B15 B17 B27  Freq: 30715
128     INC    EBP
129     CMP    EBP,[ESP + #0]
12c     Jlt,s  B13  P=0.999935 C=15360.000000

 

That's it. But just so you don't think the optimizer is now perfect, it could be even better, if HS could combine the switch statement with versioning of the inner loop, so the overhead of testing helper's concrete type and branching to the correct inlined code would be reduced to 1/500th, like this:

switch (i / 500)
{
case 0:
helper = h1;
for (int j = 0; j < calls; ++j)
helper.polyHardF(i);
break;
case 1:
helper = h2;
for (int j = 0; j < calls; ++j)
helper.polyHardF(i);
}

 

This transformation would even dispense the complex optimizations we just saw, because in the new code, there are two call-sites that are trivially identified as non-polymorphic, so the compiler is free to inline with zero overhead, and also do other tricks like replacing the entire loop by "x += 500", thus delivering a >500X boost (in this case, my microbenchmark would be declared "broken" as such code pattern hardly occurs in real app code). But the above optimization is really, really difficult; there is no optimizer (for Java at least) that can do it. Which allows me to conclude with two of my favorite clichés:
a) There's room for the next major release to be even faster.
b) Human Assembly hackers (with enough time in their hands) will always win.

Enjoy the new JDK's performance. I haven't yet tried it on real-world application benchmarks, but there are other important perf bugfixes, like Bug 6298694: bad performance with big object in heap, probably much more important than my "pet bug" discussed here, because the latter can screw up with concurrent GC performance.

Related Topics >>