 |
Deep dive into assembly code from Java
Posted by kohsuke on March 30, 2008 at 10:10 PM | Comments (22)
One of the things I learned in The Server Side Java Symposium 2008 was a command-line option to print out the assembly code that JIT is producing. Since I've always been interested in seeing the final assembly code that gets produced from your Java code, I decided to give it a test drive.
First the disclaimers:
- I'm not a performance expert.
- Don't try to take this too far, like optimizing your code against what you see here.
The option in question is only available in debug builds of JDKs. You can download one from here. The binary I tested is JDK6 u10 b14.
$ java -fullversion
java full version "1.6.0_10-beta-fastdebug-b14"
First, let's try something trivial:
public class Main {
public static void main(String[] args) {
for(int i=0; i<100; i++)
foo();
}
private static void foo() {
for(int i=0; i<100; i++)
bar();
}
private static void bar() {
}
}
I run this like "java -XX:+PrintOptoAssembly -server -cp . Main". The -XX:+PrintOptoAssembly is the magic option, and with this option I get the following, which shows the code of the "foo" method:
000 B1: # N1 <- BLOCK HEAD IS JUNK Freq: 100
000 pushq rbp
subq rsp, #16 # Create frame
nop # nop for patch_verified_entry
006 addq rsp, 16 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC
011 ret
You see that the entire bar() function call and the loop was optimized away. So it must have inlined the bar() method, then unrolled the loop.
Now to something more interesting:
private static byte[] foo() {
byte[] buf = new byte[256];
for( int i=0; i
This produces the following code:
000 B1: # B15 B2 <- BLOCK HEAD IS JUNK Freq: 78
000 # stack bang
pushq rbp
subq rsp, #80 # Create frame
00c # TLS is in R15
00c movq R8, [R15 + #120 (8-bit)] # ptr
010 movq R10, R8 # spill
013 addq R10, #280 # ptr
01a cmpq R10, [R15 + #136 (32-bit)] # raw ptr
021 jge,u B15 P=0.000100 C=-1.000000
021
027 B2: # B3 <- B1 Freq: 77.9922
027 movq [R15 + #120 (8-bit)], R10 # ptr
02b PREFETCHNTA [R10 + #256 (32-bit)] # Prefetch to non-temporal cache for write
033 movq [R8], 0x0000000000000001 # ptr
03a PREFETCHNTA [R10 + #320 (32-bit)] # Prefetch to non-temporal cache for write
042 movq RDI, R8 # spill
045 addq RDI, #24 # ptr
049 PREFETCHNTA [R10 + #384 (32-bit)] # Prefetch to non-temporal cache for write
051 movl RCX, #32 # long (unsigned 32-bit)
056 movq R10, precise klass [B: 0x00002aaaab076708:Constant:exact * # ptr
060 movq [R8 + #8 (8-bit)], R10 # ptr
064 movl [R8 + #16 (8-bit)], #256 # int
06c xorl rax, rax # ClearArray:
rep stosq # Store rax to *rdi++ while rcx--
071
071 B3: # B4 <- B16 B2 Freq: 78
071
071 # checkcastPP of R8
071 xorl R10, R10 # int
074 movl R9, #256 # int
nop # 2 bytes pad for loops and calls
07c B4: # B17 B5 <- B3 B5 Loop: B4-B5 inner stride: not constant pre of N153 Freq: 19850.2
07c cmpl R10, #256 # unsigned
083 jge,u B17 P=0.000001 C=-1.000000
083
089 B5: # B4 B6 <- B4 Freq: 19850.2
089 movslq R11, R10 # i2l
08c movb [R8 + #24 + R11], #0 # byte
092 incl R10 # int
095 cmpl R10, #8
099 jlt,s B4 P=0.996072 C=22313.000000
099
09b B6: # B11 B7 <- B5 Freq: 77.9799
09b subl R9, R10 # int
09e andl R9, #-16 # int
0a2 addl R9, R10 # int
0a5 cmpl R10, R9
0a8 jge,s B11 P=0.500000 C=-1.000000
0a8
0aa B7: # B8 <- B6 Freq: 38.9899
0aa PXOR XMM0,XMM0 ! replicate8B
nop # 2 bytes pad for loops and calls
0b0 B8: # B10 B9 <- B7 B9 Loop: B8-B9 inner stride: not constant main of N85 Freq: 9925.09
0b0 movslq R11, R10 # i2l
0b3 MOVQ [R8 + #24 + R11],XMM0 ! packed8B
0ba movl R11, R10 # spill
0bd addl R11, #16 # int
0c1 movslq R10, R10 # i2l
0c4 MOVQ [R8 + #32 + R10],XMM0 ! packed8B
0cb cmpl R11, R9
0ce jge,s B10 P=0.003928 C=22313.000000
0ce
0d0 B9: # B8 <- B8 Freq: 9886.1
0d0 movl R10, R11 # spill
0d3 jmp,s B8
0d3
0d5 B10: # B11 <- B8 Freq: 38.9899
0d5 movl R10, R11 # spill
0d5
0d8 B11: # B14 B12 <- B6 B10 Freq: 77.9799
0d8 cmpl R10, #256
0df jge,s B14 P=0.500000 C=-1.000000
nop # 3 bytes pad for loops and calls
0e4 B12: # B17 B13 <- B11 B13 Loop: B12-B13 inner stride: not constant post of N153 Freq: 9922.54
0e4 cmpl R10, #256 # unsigned
0eb jge,us B17 P=0.000001 C=-1.000000
0eb
0ed B13: # B12 B14 <- B12 Freq: 9922.53
0ed movslq R11, R10 # i2l
0f0 movb [R8 + #24 + R11], #0 # byte
0f6 incl R10 # int
0f9 cmpl R10, #256
100 jlt,s B12 P=0.996072 C=22313.000000
100
102 B14: # N1 <- B13 B11 Freq: 77.9698
102 movq RAX, R8 # spill
105 addq rsp, 80 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC
110 ret
110
111 B15: # B18 B16 <- B1 Freq: 0.00780129
111 movq RSI, precise klass [B: 0x00002aaaab076708:Constant:exact * # ptr
11b movl RDX, #256 # int
120 nop # 3 bytes pad for loops and calls
123 call,static wrapper for: _new_array_Java
# Main::foo @ bci:3 L[0]=_ L[1]=_
#
128
128 B16: # B3 <- B15 Freq: 0.00780114
# Block is sole successor of call
128 movq R8, RAX # spill
12b jmp B3
12b
130 B17: # N1 <- B12 B4 Freq: 1e-06
130 movl RSI, #-28 # int
135 movq RBP, R8 # spill
138 movl [rsp + #0], R10 # spill
13c nop # 3 bytes pad for loops and calls
13f call,static wrapper for: uncommon_trap(reason='range_check' action='make_not_entrant')
# Main::foo @ bci:17 L[0]=RBP L[1]=rsp + #0 STK[0]=RBP STK[1]=rsp + #0 STK[2]=#0
# AllocatedObj(0x0000000040c31880)
144 int3 # ShouldNotReachHere
144
151 B18: # N1 <- B15 Freq: 7.80129e-08
151 # exception oop is in rax; no code emitted
151 movq RSI, RAX # spill
154 addq rsp, 80 # Destroy frame
popq rbp
159 jmp rethrow_stub
Just to recap, R8-R15 are additional general-purpose 64bit registers new in the amd64.
The first part (00c-027) is allocating an array, and this is already interesting. As the comment indicates, R15 is apparently used as a pointer to a thread-local storage of the current thread, and R15[120] is the pointer to the head of the heap sub-space dedicated for this thread.
So the byte[] is allocated from this thread-local space by simply reserving 256+32 byte space. If there's not enough space (the limit is set at R15[136]), then it uses the slower allocation code at B15 — this code must involve in reserving a new chunk from the eden space and allocate a new object there.
Once the pointer to the new array is set to R8 at 00c, the initialization follows (033-071.) The first 24 bytes of the newly allocated space is used for metadata (the first 8 byte is probably lock or GC-related, followed by a pointer to the class object, then another 8 bytes for the size of the array.) 06c zero-clears the array. In theory the zero-clear shouldn't have been necessary, as we are then filling the array to zero again, but JIT failed to take advantage of that.
But note that the zero-clear is done by 8 bytes at a time, so it did recognize that the array size is multiple of 8.
I don't quite understand what those prefetch instructions (at 02b, 03a, and 049) are meant for. Presumably they are to make sure that the next time an object allocation happens, that part of the memory is in cache, but why 256, 320, and 384? Does anyone have a clue?
Now as of 074, R8 is the pointer to 'buf' and R9 is the length of the array. Note that JIT knows that buf.length is always 256 here, so this is movl R9,256 and not movl R9,[R8+16]. Also note that this computation is outside the for loop. So this tells us that there's no need to explicitly assign the array length to a temporary variable in a tight loop, because JIT does the equivalent anyway:
int len = buf.length;
for( int i=0; i
Similarly there's no need to reverse the direction of the loop to avoid buf.length computation.
The way the loop is compiled is very interesting. First there's the 'warm up' part (07c-099) that presumably does the array filling until it reaches the 8-byte boundary, then the 'fast loop' portion (09b-0d3) that zero-fills 8 bytes per loop by using an MMX register, then the final 'cool down' part (0d5-100) that handles the last remaining part that doesn't fit 8 byte boundary. In this case, in theory it could have figured out that the whole thing nicely fits 8-byte boundary, so the warm up and cool down was unnecessary, but it appears that JIT didn't realize this.
I don't know what kind of computation happens behind the scene here, but overall this loop unrolling is rather clever. The original code was byte-by-byte assignment to 0, but in the final code, one loop iteration clears 8 byte at a time.
I also noticed that there's no array boundary check in the fast loop portion, which is nice.
OK, most of you have hopefully heard that in JDK6 they do lock coarsening and lock elision. So let's see that in action.
For that, I compiled the following code and executed in the same fashion:
private static void foo() {
Vector v = new Vector();
v.add("abc");
v.add("def");
v.add("ghi");
}
This gives me the following:
000 B1: # B10 B2 <- BLOCK HEAD IS JUNK Freq: 20168
000 # stack bang
pushq rbp
subq rsp, #80 # Create frame
00c # TLS is in R15
00c movq RAX, [R15 + #120 (8-bit)] # ptr
010 movq R10, RAX # spill
013 addq R10, #40 # ptr
017 # TLS is in R15
017 cmpq R10, [R15 + #136 (32-bit)] # raw ptr
01e jge,u B10 P=0.000100 C=-1.000000
01e
024 B2: # B3 <- B1 Freq: 20166
024 # TLS is in R15
024 movq [R15 + #120 (8-bit)], R10 # ptr
028 PREFETCHNTA [R10 + #256 (32-bit)] # Prefetch to non-temporal cache for write
030 movq R10, precise klass java/util/Vector: 0x00002aaaf2649f38:Constant:exact * # ptr
03a movq R11, [R10 + #176 (32-bit)] # ptr
041 movq [RAX], R11 # ptr
044 movq [RAX + #8 (8-bit)], R10 # ptr
048 movq [RAX + #16 (8-bit)], #0 # long
050 movq [RAX + #24 (8-bit)], #0 # long
058 movq [RAX + #32 (8-bit)], #0 # long
058
060 B3: # B12 B4 <- B11 B2 Freq: 20168
060
060 movq RBP, RAX # spill
063 # checkcastPP of RBP
063 # TLS is in R15
063 movq R11, [R15 + #120 (8-bit)] # ptr
067 movq R10, R11 # spill
06a addq R10, #104 # ptr
06e # TLS is in R15
06e cmpq R10, [R15 + #136 (32-bit)] # raw ptr
075 jge,u B12 P=0.000100 C=-1.000000
075
07b B4: # B5 <- B3 Freq: 20166
07b # TLS is in R15
07b movq [R15 + #120 (8-bit)], R10 # ptr
07f PREFETCHNTA [R10 + #256 (32-bit)] # Prefetch to non-temporal cache for write
087 movq [R11], 0x0000000000000001 # ptr
08e PREFETCHNTA [R10 + #320 (32-bit)] # Prefetch to non-temporal cache for write
096 movq RDI, R11 # spill
099 addq RDI, #24 # ptr
09d PREFETCHNTA [R10 + #384 (32-bit)] # Prefetch to non-temporal cache for write
0a5 movq R10, precise klass [Ljava/lang/Object;: 0x00002aaaf264e928:Constant:exact * # ptr
0af movq [R11 + #8 (8-bit)], R10 # ptr
0b3 movl [R11 + #16 (8-bit)], #10 # int
0bb movl RCX, #10 # long (unsigned 32-bit)
0c0 xorl rax, rax # ClearArray:
rep stosq # Store rax to *rdi++ while rcx--
0c5
0c5 B5: # B16 B6 <- B13 B4 Freq: 20168
0c5
0c5 # checkcastPP of R11
0c5 movq [RBP + #32 (8-bit)], R11 # ptr ! Field java/util/Vector.elementData
0c9 movq R10, RBP # ptr -> long
0cc shrq R10, #9
0d0 movq RDX, java/lang/String:exact * # ptr
0da movq R11, 0x00002a959c9da580 # ptr
0e4 movb [R11 + R10], #0 # byte
0e9 movq RSI, RBP # spill
0ec nop # 3 bytes pad for loops and calls
0ef call,static java.util.Vector::add
# Main::foo @ bci:11 L[0]=RBP
# AllocatedObj(0x0000000040b30680)
0f4
0f4 B6: # B15 B7 <- B5 Freq: 20167.6
# Block is sole successor of call
0f4 movq RDX, java/lang/String:exact * # ptr
0fe movq RSI, RBP # spill
101 nop # 2 bytes pad for loops and calls
103 call,static java.util.Vector::add
# Main::foo @ bci:18 L[0]=RBP
# AllocatedObj(0x0000000040b30680)
108
108 B7: # B14 B8 <- B6 Freq: 20167.2
# Block is sole successor of call
108 movq RDX, java/lang/String:exact * # ptr
112 movq RSI, RBP # spill
115 nop # 2 bytes pad for loops and calls
117 call,static java.util.Vector::add
# Main::foo @ bci:25 L[0]=_
#
11c
11c B8: # N1 <- B7 Freq: 20166.8
# Block is sole successor of call
11c addq rsp, 80 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC
127 ret
(slow path omitted)
The allocation of a Vector object (00c-058) is almost identical to the array allocation code we've seen before (except the additional field initializations at 048-058.) The array allocation for Vector.elementData follows (060-0C0.)
Note that the Vector constructors are defined in highly nested fashion like this:
public Vector(int initialCapacity, int capacityIncrement) {
super();
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal Capacity: "+
initialCapacity);
this.elementData = new Object[initialCapacity];
this.capacityIncrement = capacityIncrement;
}
public Vector(int initialCapacity) {
this(initialCapacity, 0);
}
public Vector() {
this(10);
}
... but the whole thing is inlined, so the end result is just as fast as the following code. This is great.
public Vector() {
this.elementData = new Object[10];
this.capacityIncrement = 0;
}
But wait, after that, you see that there's three call instructions for Vector.add. So there's no lock elision nor lock coarsening, despite the fact that this Vector object never escapes the stack.
I thought perhaps that's because Vector.add is too complex to be inlined, so I tried the following code, in the hope of seeing the lock elision:
private static void foo() {
Foo foo = new Foo();
foo.inc();
foo.inc();
foo.inc();
}
private static final class Foo {
int i=0;
public synchronized void inc() {
i++;
}
}
This produced the following code:
000 B1: # B6 B2 <- BLOCK HEAD IS JUNK Freq: 19972
000 # stack bang
pushq rbp
subq rsp, #80 # Create frame
00c # TLS is in R15
00c movq RBP, [R15 + #120 (8-bit)] # ptr
010 movq R10, RBP # spill
013 addq R10, #24 # ptr
017 cmpq R10, [R15 + #136 (32-bit)] # raw ptr
01e jge,u B6 P=0.000100 C=-1.000000
01e
024 B2: # B3 <- B1 Freq: 19970
024 movq [R15 + #120 (8-bit)], R10 # ptr
028 PREFETCHNTA [R10 + #256 (32-bit)] # Prefetch to non-temporal cache for write
030 movq R10, precise klass Main$Foo: 0x00002aaaf2646e58:Constant:exact * # ptr
03a movq R11, [R10 + #176 (32-bit)] # ptr
041 movq [RBP], R11 # ptr
045 movq [RBP + #8 (8-bit)], R10 # ptr
049 movq [RBP + #16 (8-bit)], #0 # long
049
051 B3: # B8 B4 <- B7 B2 Freq: 19972
051
051 # checkcastPP of RBP
051 leaq R11, [rsp + #64] # box lock
056 fastlock RBP,R11,RAX,R10
135 jne B8 P=0.000001 C=-1.000000
135
13b B4: # B9 B5 <- B8 B3 Freq: 19972
13b MEMBAR-acquire (prior CMPXCHG in FastLock so empty encoding)
13b movl R11, [RBP + #16 (8-bit)] # int ! Field Main$Foo.i
13f incl R11 # int
142 movl [RBP + #16 (8-bit)], R11 # int ! Field Main$Foo.i
146 MEMBAR-release
146 MEMBAR-acquire
146 movl R11, [RBP + #16 (8-bit)] # int ! Field Main$Foo.i
14a incl R11 # int
14d movl [RBP + #16 (8-bit)], R11 # int ! Field Main$Foo.i
151 MEMBAR-release
151 MEMBAR-acquire
151 movl R11, [RBP + #16 (8-bit)] # int ! Field Main$Foo.i
155 incl R11 # int
158 movl [RBP + #16 (8-bit)], R11 # int ! Field Main$Foo.i
15c MEMBAR-release (a FastUnlock follows so empty encoding)
15c leaq RAX, [rsp + #64] # box lock
161 fastunlock RBP, RAX, R10
218 jne,s B9 P=0.000001 C=-1.000000
218
21a B5: # N1 <- B9 B4 Freq: 19972
21a addq rsp, 80 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC
225 ret
(slow path omitted)
We are all familar with the memory allocation by now, so we can skip that.
The 'fastlock' pseudo-instruction (AFAIK there's no such operation in amd64, and a single machine code can't possibly occupy 223 bytes!) must be the lock code. Here you see that the lock coarsening has indeed happened (yay!), and three increments happen in a single block (MEMBAR-acquire/release must be another pseudo-instruction, which became no-op in this scenario — see that the length of those instructions are 0).
Note that JVM still fails to eliminate a lock here, despite the fact that this object doesn't escape the stack. I tried various things to see the effect of escape analysis and lock elision kick in, but couldn't find a way to do it. It looks like this feature is not quite in JDK yet, although it's equally possible that I'm doing something stupid.
Also note that presumably because of the memory barrier associated with this, each increments write back to memory. This is unfortunate because in theory three increments could have been combined into one, given the the lock was coarsened.
Indeed if I remove the 'synchronized' keyword, I get the following substantially simpler version:
000 B1: # B4 B2 <- BLOCK HEAD IS JUNK Freq: 27066
000 # stack bang
pushq rbp
subq rsp, #16 # Create frame
00c # TLS is in R15
00c movq RAX, [R15 + #120 (8-bit)] # ptr
010 movq R10, RAX # spill
013 addq R10, #24 # ptr
017 cmpq R10, [R15 + #136 (32-bit)] # raw ptr
01e jge,us B4 P=0.000100 C=-1.000000
01e
020 B2: # B3 <- B1 Freq: 27063.3
020 movq [R15 + #120 (8-bit)], R10 # ptr
024 PREFETCHNTA [R10 + #256 (32-bit)] # Prefetch to non-temporal cache for write
02c movq R10, precise klass Main$Foo: 0x00002aaaf25dfbc8:Constant:exact * # ptr
036 movq R11, [R10 + #176 (32-bit)] # ptr
03d movq [RAX], R11 # ptr
040 movq [RAX + #8 (8-bit)], R10 # ptr
040
044 B3: # N1 <- B5 B2 Freq: 27066
044 movq [RAX + #16 (8-bit)], #3 # long
04c
04c addq rsp, 16 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC
057 ret
(slow path omitted)
So not only three inc()s but the initializer also got collapsed into single "movq rax[16],3" call. Wow!
All in all, modern JVMs seem pretty good at generating optimal code. In various situations, the resulting assembly code is far from the straight-forward instruction-by-instruction translation. OTOH, the escape analysis doesn't really seem to do anything useful yet.
This was a long post, but I hope you enjoyed this as much as I did.
Bookmark blog post: del.icio.us Digg DZone Furl Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment
-
cool++
Posted by: anjanb2 on March 30, 2008 at 11:08 PM
-
Hi,
Nice post. Regarding escape analysis, I _think_ you have to enable it by adding:
-XX:+DoEscapeAnalysis
Regards,
Ismael
Posted by: ijuma on March 31, 2008 at 01:52 AM
-
ijuma — Yes, I tried "-XX:+DoEscapeAnalysis" and "-XX:+EliminateLocks" but I didn't notice any difference in the compiled code.
Posted by: kohsuke on March 31, 2008 at 08:26 AM
-
Interesting, I vaguely remember seeing some benchmarks showing that StringBuffer had no overhead in comparison to StringBuilder when that option was enabled. I didn't try it myself though.
Posted by: ijuma on March 31, 2008 at 08:39 AM
-
Unfortunately, the crucial option (-XX:+PrintOptoAssembly) is available only to sun guys. This option requires disassembler library which is not available outside sun (despite it is is based on gnu sources). For more than 1 year i am regularly checking this issue and there is still no visible progress.
Posted by: mstanik on March 31, 2008 at 10:41 PM
-
mstanik — sorry, that's just not true. You probably mean -XX:+PrintAssembly. As I wrote in this blog, all it takes for you to do this is a debug build of JDK.
Posted by: kohsuke on April 01, 2008 at 07:09 AM
-
I'm also not a HotSpot expert, but it seems to me that lock coarseing is working for your foo() method. Your Assembly listing shows that the lock/unlock operations happen only once:
051 leaq R11, [rsp + #64] # box lock
056 fastlock RBP,R11,RAX,R10
135 jne B8 P=0.000001 C=-1.000000
... three copies of the foo.inc() code...
15c leaq RAX, [rsp + #64] # box lock
161 fastunlock RBP, RAX, R10
218 jne,s B9 P=0.000001 C=-1.000000
(There's also the out-of-line, slow locking code for inflated locks, at labels B8 and B9, that you don't show.) HotSpot is just not coarsening the memory barrier operations that are necessary to preserve JMM semantics. Apparently HotSpot's lock optimizations are not sufficiently smart to risk messing with happens-before constraints. And because the barriers are not elided, HotSpot can't also perform additional opts like constant propagation etc.
Posted by: opinali on April 01, 2008 at 12:05 PM
-
opinali — yes, but that's with Foo.inc() method. Where I was complaining, I was complaining about the lack of lock coarsening in Vector.add.
So I guess the best we can say is that the lock coarsening does work in some situations.
Posted by: kohsuke on April 01, 2008 at 12:11 PM
-
kohsuke, sorry the bad post - I must have missed/typo'ed the ending PRE tag, didn't use the preview... can you fix that?
Posted by: opinali on April 01, 2008 at 12:13 PM
-
Done.
Posted by: kohsuke on April 01, 2008 at 01:27 PM
-
Yep, I focused on the inc() example as you claimed that " JVM still fails to eliminate a lock here, despite the fact that this object doesn't escape the stack." You should be correct that the Vector.add() code is even worse because it's not inlined... this is an effect of callee-side locking. Suggestion: HotSpot could compile synchronized methods with a stub mechanism, i.e. a stub that does only the synchronization and calls the main code, so this main code could be invoked directly from callers that can optimize out the locking but cannot inline. This would have minimal cost for invocations into the synchronized stub (an extra CALL and RET), but would help optimize other interesting scenarios, like recursive calls and reentrant synchronization (when a given class has two synchronized methods a and b, and a invokes b, again without inlining).
Posted by: opinali on April 01, 2008 at 06:12 PM
-
Just a small suggestion if you will: it'd be interesting to see the assembly code produced for SPARC if you have access to such a machine and the Sun JVM on SPARC supports the same options. A lot of university's teach both Java and SPARC or MIPS assembly programming. SPARC assembly is also, IMHO, easier to read than x86. SPARC output would make this article a useful resource for many teachers, professors, and students.
Posted by: art_ on April 04, 2008 at 11:45 AM
-
opinali — I did say the lock coarsening happened with Foo.inc, but it's still doing a lock once, which it shouldn't have done, if the escape analysis was working as I expected.
The double-call approach you mention is interesting; maybe you should have a JVM :-)
Posted by: kohsuke on April 04, 2008 at 01:56 PM
-
art_ — thank you for the suggestion. I do have SPARC systems, but I think I'll leave the exercise to someone else.
Posted by: kohsuke on April 04, 2008 at 01:58 PM
-
mstanik — we're getting there; see http://wikis.sun.com/display/HotSpotInternals/PrintAssembly .
Posted by: jrose on April 05, 2008 at 08:29 PM
-
All this optimization stuff is quite nice and shows that Sun engineers have packed a great deal of intelligence in the Hotspot compiler. But a question remains: couldn't we get still better optimizations if we used an ahead of time compiler? or if Hostspot was able to save and re-use the compilation results between runs?
Posted by: genepi on April 07, 2008 at 02:03 PM
-
genepi — Isn't Java already supposed to run faster than C on many benchmarks? So in that sense, isn't that argument already pretty much decided? I do however wonder if there can be some ahead of the time compilation to improve the start up performance, possibly with later recompilation as HotSpot acquires more information.
Posted by: kohsuke on April 07, 2008 at 03:57 PM
-
The non-temporal prefetches are indeed interesting. In the 256 array clearing example they are 64 bytes apart, which makes sense as that matches the width of a cache line. They are non-temporal, which is interesting as it means that when that line is evicted from the L1 data cache it will be sent directly to memory and not to L2. This means the compiler believes that that data will be written to in the near'ish future but not read from. The choice of memory locations is also puzzling as they would appear to be at the end of the array being allocated and beyond, unless I've got my math wrong?
Posted by: asymtote on April 07, 2008 at 04:25 PM
-
Awesome post!
As for the locking optimizations, Jeroen Borgers has written a very nice article asking the question, do these lock optimizations really work. It will be published on InfoQ as soon as I finish editing it 8^)
Posted by: kcpeppe on April 07, 2008 at 11:27 PM
-
Sorry I'm late to the party.
I'm the architect of the -server JIT.
If you have any lingering questions, please re-post them and I'll try to answer them.
As for the prefetch of 256+n*64 - that is so the new allocations hit in cache. Allocation memory is typically "very old" - not touched for a long time (probably since the last GC cycle) and hence isn't in any cache layer. The prefetch during *this* allocation means that the *next* allocation runs faster.
Also, I can print Azul assembly for those who are interested.
Cliff
Posted by: cliffc on May 01, 2008 at 08:20 AM
-
cliffc — thanks for the clarification. That's kind of what I suspected, but good to get a confirmation. I'm sure Azul assembly post (or in fact any insights into any JVM) would fascinate many folks, especially if it's coming from you.
Posted by: kohsuke on May 01, 2008 at 09:28 AM
-
Great post! But how did you get it to work? I downloaded "1.6.0_10-beta-fastdebug-b22" and I don't see the output you see. Was there a change from b14 to b22 with the PrintOptoAssembly option?
All I get is a hotspot.log file in an XML format, which doesn't contain any assembly code..
Posted by: aviadbd on May 11, 2008 at 05:57 AM
|