|
|
||
Osvaldo Pinali Doederlein's BlogPerformance ArchivesJava SE 5.0 Update 8's cool performance fixesPosted by opinali on August 11, 2006 at 01:17 PM | Permalink | Comments (0)Java SE 5.0 Update 8 is available. Besides the traditional few dozens of bugfixes, this release closes a number of important performance bugs in the JIT compiler, GC and runtime. So even if you play more in the conservative side, it's a great time to evaluate and upgrade your Java runtime. One of the fixed perf bugs is Bug 4803284: Bad performance when HotSpot cannot optimize polymorphic calls, which I reported in January 2003 against JDK 1.4.1. That bug received a partial fix with Tiger-b28, when Sun promised the full solution "in next major release" (Mustang), splitting this to Bug 5074577. There's also a dupe, Bug 4898726: Nonstatic method calls take lot more time than static method calls. Well, it seems Sun kept the promise and these bugs are all fully solved now! I obviously had to re-run my MethodBench (included in the bug report) to see if the new fix did any difference (as the partial fix already made me happy). And it does, only for HotSpot Server:
JDK 5.0 Update 7: D:\Java\Bench\Method>java -server -Xbatch MethodHard *** Testing (Hard) *** Interf TPoly Poly Virtual Final Static Native 35,530 25,682 01,155 01,155 01,146 01,138 259,300 JDK 5.0 Update 8: D:\Java\Bench\Method>java -server -Xbatch MethodHard *** Testing (Hard) *** Interf TPoly Poly Virtual Final Static Native 14,754 23,599 01,134 01,130 01,134 01,134 254,700 JDK 6.0 build 94: D:\Java\Bench\Method>java -server -Xbatch MethodHard *** Testing (Hard) *** Interf TPoly Poly Virtual Final Static Native 13,711 25,318 01,142 01,146 01,146 01,155 67,980 (The Reflection test is very slow in the "Hard" case, and I'm in a hurry now.) The improvement is in the Interf test (interface calls), where both 5.0u8 and Mustang beat u7 in 14ns vs. 35ns per call. Other tests show only marginal improvements. Well, in a world full of complex frameworks abusing of interfaces, this optimization may be significant. And this is for a "hard" test where the compiler cannot infer a single concrete type (devirtualization). See the relevant benchmark code:
public void bench ()
{
int calls = operations / 1000;
// Note: HelperB and HelperD are unrelated classes (don't inherit each other),
// but both implement the polyHardF() method defined by HelperInterface.
HelperInterface helper = null, h1 = new HelperB(), h2 = new HelperD();
for (int i = 0; i < 1000; ++i)
{
switch (i / 500)
{
// Here, I periodically reassign the helper variable to either HelperB or HelperD,
// making the call-site below really polymorphic, which kills easy dataflow-based
// dervirtualization. Notice also that 50% of all calls will go to each class,
// so profiling won't let the JIT decide to optimize one case (e.g. helperB) in
// detriment of the other with a type-check + inlining only for the hot case.
case 0: helper = h1; break;
case 1: helper = h2; break;
}
for (int j = 0; j < calls; ++j)
// This is the polymorphic, interface-based, call-site to optimize.
// The invoked methods perform a trivial operation, but one that
// cannot be optimized out (it updates a static variable).
helper.polyHardF(i);
}
}
From the bugtrack comments, it seems that this fix is not specific to interface calls. It's only an artifact of my benchmark that the limitation of the optimizer was only caught in the MethodHard/Interf test. But since interface-based calls are by far the hardest ones to optimize (due to multiple inheritance), this is good news for complex OO code overall. How does HotSpot compile this code, in 5.0u8 and Mustang? I looked at the code (using Mustang's debug build and -XX:+PrintOptoAssembly) and saw that it actually inlines both HelperB's and HelperD's implementations of polyHardF(). It's a strong demonstration of several optimizations combined: the compiler must determine that there are only two possible concrete types for the call-site, and that both deserve inlining... and there's more - let's check the code:
# Start of the inner loop: int j = 0...
0d8 B12: # B13 <- B11 Freq: 15356
0d8 XOR EBP,EBP
0da
# ...Loop body...
0da B13: # B29 B14 <- B12 B18 Loop: B13-B18 stride: not constant Freq: 30715
0da MOV ECX,[ESP + #12]
0de MOV EBX,[ECX + #4]
0e1 NullCheck ECX
0e1
# The "payload" code for the inlined methods is here, because both
# implementations of testHardF() are identical = { ++x; } where x is a static int
# (and it's the same x, in a shared superclass). HotSpot was smart enough to
# merge part of the common code (the read and the increment) and move it up.
0e1 B14: # B16 B15 <- B13 Freq: 30715
0e1 MOV ECX,#344
0e6 MOV EDI,[ECX + precise klass Helper: 0x00a3e2a8:Constant:exact *]
0ec ADD EDI,[ESP + #16]
# Guard code: helper instanceof HelperB?
0f0 CMPu EBX,precise klass HelperB: 0x00a3c3d0:Constant:exact *
0f6 Jne,us B16 P=0.266667 C=-1.000000
0f6
# Inlined code for (the rest of) HelperB.testHardF().
# I don't understand completely the need for this code, but I guess HotSpot
# wants to keep track of stack frames, and/or produce a trap if there is some
# change in the class hierarchy that invalidates the inlining.
0f8 B15: # B18 <- B14 Freq: 22524.3
0f8 MOV ECX,[ESP + #12]
0fc #checkcastPP of ECX
0fc MOV [ESP + #12],ECX
100 MOV EBX,#344
# Here the new value for x is stored -- the read and increment were moved up
# for sharing, but the write must be performed in the right place (I think)
# to not produce illegal behaviors: e.g., if helper is neither HelperB nor HelperD,
# quite possible if it is null (the benchmark code makes difficult to prove the
# never-null property of helper), then a NullPointerException should be raised
# and x should not be updated due to out-of-order code.
105 MOV [EBX + precise klass Helper: 0x00a3e2a8:Constant:exact *],EDI
10b JMP,s B18
# Guard code: helper instanceof HelperD?
10b
10d B16: # B26 B17 <- B14 Freq: 8190.66
10d CMPu EBX,precise klass HelperD: 0x00a3c438:Constant:exact *
113 Jne,us B26 P=0.000001 C=-1.000000
113
# Inlined code for (the rest of) HelperD.testHardF().
# Same stuff as before. It's sad that, both methods being exactly identical,
# HotSpot inlines and merges only part of their bodies, but doesn't do it
# neither for this bookkeeping code nor for the store. Quite likely HS doesn't
# want that, so it doesn't lose the ability to walk stacks precisely, in events
# like GC, OSR, and exception stack-trace filling (don't you hate JVMs that don't
# report exact line numbers for all frames, in compiled code?).
115 B17: # B18 <- B16 Freq: 8190.65
115 MOV ECX,[ESP + #12]
119 #checkcastPP of ECX
119 MOV [ESP + #12],ECX
11d MOV EBX,#344
122 MOV [EBX + precise klass Helper: 0x00a3e2a8:Constant:exact *],EDI
122
# ...End of the inner loop: ++j; go back if j < calls
128 B18: # B13 B19 <- B15 B17 B27 Freq: 30715
128 INC EBP
129 CMP EBP,[ESP + #0]
12c Jlt,s B13 P=0.999935 C=15360.000000
That's it. But just so you don't think the optimizer is now perfect, it could be even better, if HS could combine the switch statement with versioning of the inner loop, so the overhead of testing helper's concrete type and branching to the correct inlined code would be reduced to 1/500th, like this:
switch (i / 500)
{
case 0:
helper = h1;
for (int j = 0; j < calls; ++j)
helper.polyHardF(i);
break;
case 1:
helper = h2;
for (int j = 0; j < calls; ++j)
helper.polyHardF(i);
}
This transformation would even dispense the complex optimizations we just saw, because in the new code, there are two call-sites that are trivially identified as non-polymorphic, so the compiler is free to inline with zero overhead, and also do other tricks like replacing the entire loop by "x += 500", thus delivering a >500X boost (in this case, my microbenchmark would be declared "broken" as such code pattern hardly occurs in real app code). But the above optimization is really, really difficult; there is no optimizer (for Java at least) that can do it. Which allows me to conclude with two of my favorite clichés:
Enjoy the new JDK's performance. I haven't yet tried it on real-world application benchmarks, but there are other important perf bugfixes, like Bug 6298694: bad performance with big object in heap, probably much more important than my "pet bug" discussed here, because the latter can screw up with concurrent GC performance. | ||
|
|