Deep dive into assembly code from Java
One of the things I learned in The Server Side Java Symposium 2008 was a command-line option to print out the assembly code that JIT is producing. Since I've always been interested in seeing the final assembly code that gets produced from your Java code, I decided to give it a test drive.
First the disclaimers:
- I'm not a performance expert.
- Don't try to take this too far, like optimizing your code against what you see here.
The option in question is only available in debug builds of JDKs. You can download one from here. The binary I tested is JDK6 u10 b14.
$ java -fullversion java full version "1.6.0_10-beta-fastdebug-b14"
First, let's try something trivial:
public class Main { public static void main(String[] args) { for(int i=0; i<100; i++) foo(); } private static void foo() { for(int i=0; i<100; i++) bar(); } private static void bar() { } }
I run this like "java -XX:+PrintOptoAssembly -server -cp . Main". The -XX:+PrintOptoAssembly is the magic option, and with this option I get the following, which shows the code of the "foo" method:
000 B1: # N1 <- BLOCK HEAD IS JUNK Freq: 100 000 pushq rbp subq rsp, #16 # Create frame nop # nop for patch_verified_entry 006 addq rsp, 16 # Destroy frame popq rbp testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC 011 ret
You see that the entire bar() function call and the loop was optimized away. So it must have inlined the bar() method, then unrolled the loop.
Now to something more interesting:
private static byte[] foo() { byte[] buf = new byte[256]; for( int i=0; i
This produces the following code:
000 B1: # B15 B2 <- BLOCK HEAD IS JUNK Freq: 78 000 # stack bang pushq rbp subq rsp, #80 # Create frame 00c # TLS is in R15 00c movq R8, [R15 + #120 (8-bit)] # ptr 010 movq R10, R8 # spill 013 addq R10, #280 # ptr 01a cmpq R10, [R15 + #136 (32-bit)] # raw ptr 021 jge,u B15 P=0.000100 C=-1.000000 021 027 B2: # B3 <- B1 Freq: 77.9922 027 movq [R15 + #120 (8-bit)], R10 # ptr 02b PREFETCHNTA [R10 + #256 (32-bit)] # Prefetch to non-temporal cache for write 033 movq [R8], 0x0000000000000001 # ptr 03a PREFETCHNTA [R10 + #320 (32-bit)] # Prefetch to non-temporal cache for write 042 movq RDI, R8 # spill 045 addq RDI, #24 # ptr 049 PREFETCHNTA [R10 + #384 (32-bit)] # Prefetch to non-temporal cache for write 051 movl RCX, #32 # long (unsigned 32-bit) 056 movq R10, precise klass [B: 0x00002aaaab076708:Constant:exact * # ptr 060 movq [R8 + #8 (8-bit)], R10 # ptr 064 movl [R8 + #16 (8-bit)], #256 # int 06c xorl rax, rax # ClearArray: rep stosq # Store rax to *rdi++ while rcx-- 071 071 B3: # B4 <- B16 B2 Freq: 78 071 071 # checkcastPP of R8 071 xorl R10, R10 # int 074 movl R9, #256 # int nop # 2 bytes pad for loops and calls 07c B4: # B17 B5 <- B3 B5 Loop: B4-B5 inner stride: not constant pre of N153 Freq: 19850.2 07c cmpl R10, #256 # unsigned 083 jge,u B17 P=0.000001 C=-1.000000 083 089 B5: # B4 B6 <- B4 Freq: 19850.2 089 movslq R11, R10 # i2l 08c movb [R8 + #24 + R11], #0 # byte 092 incl R10 # int 095 cmpl R10, #8 099 jlt,s B4 P=0.996072 C=22313.000000 099 09b B6: # B11 B7 <- B5 Freq: 77.9799 09b subl R9, R10 # int 09e andl R9, #-16 # int 0a2 addl R9, R10 # int 0a5 cmpl R10, R9 0a8 jge,s B11 P=0.500000 C=-1.000000 0a8 0aa B7: # B8 <- B6 Freq: 38.9899 0aa PXOR XMM0,XMM0 ! replicate8B nop # 2 bytes pad for loops and calls 0b0 B8: # B10 B9 <- B7 B9 Loop: B8-B9 inner stride: not constant main of N85 Freq: 9925.09 0b0 movslq R11, R10 # i2l 0b3 MOVQ [R8 + #24 + R11],XMM0 ! packed8B 0ba movl R11, R10 # spill 0bd addl R11, #16 # int 0c1 movslq R10, R10 # i2l 0c4 MOVQ [R8 + #32 + R10],XMM0 ! packed8B 0cb cmpl R11, R9 0ce jge,s B10 P=0.003928 C=22313.000000 0ce 0d0 B9: # B8 <- B8 Freq: 9886.1 0d0 movl R10, R11 # spill 0d3 jmp,s B8 0d3 0d5 B10: # B11 <- B8 Freq: 38.9899 0d5 movl R10, R11 # spill 0d5 0d8 B11: # B14 B12 <- B6 B10 Freq: 77.9799 0d8 cmpl R10, #256 0df jge,s B14 P=0.500000 C=-1.000000 nop # 3 bytes pad for loops and calls 0e4 B12: # B17 B13 <- B11 B13 Loop: B12-B13 inner stride: not constant post of N153 Freq: 9922.54 0e4 cmpl R10, #256 # unsigned 0eb jge,us B17 P=0.000001 C=-1.000000 0eb 0ed B13: # B12 B14 <- B12 Freq: 9922.53 0ed movslq R11, R10 # i2l 0f0 movb [R8 + #24 + R11], #0 # byte 0f6 incl R10 # int 0f9 cmpl R10, #256 100 jlt,s B12 P=0.996072 C=22313.000000 100 102 B14: # N1 <- B13 B11 Freq: 77.9698 102 movq RAX, R8 # spill 105 addq rsp, 80 # Destroy frame popq rbp testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC 110 ret 110 111 B15: # B18 B16 <- B1 Freq: 0.00780129 111 movq RSI, precise klass [B: 0x00002aaaab076708:Constant:exact * # ptr 11b movl RDX, #256 # int 120 nop # 3 bytes pad for loops and calls 123 call,static wrapper for: _new_array_Java # Main::foo @ bci:3 L[0]=_ L[1]=_ # 128 128 B16: # B3 <- B15 Freq: 0.00780114 # Block is sole successor of call 128 movq R8, RAX # spill 12b jmp B3 12b 130 B17: # N1 <- B12 B4 Freq: 1e-06 130 movl RSI, #-28 # int 135 movq RBP, R8 # spill 138 movl [rsp + #0], R10 # spill 13c nop # 3 bytes pad for loops and calls 13f call,static wrapper for: uncommon_trap(reason='range_check' action='make_not_entrant') # Main::foo @ bci:17 L[0]=RBP L[1]=rsp + #0 STK[0]=RBP STK[1]=rsp + #0 STK[2]=#0 # AllocatedObj(0x0000000040c31880) 144 int3 # ShouldNotReachHere 144 151 B18: # N1 <- B15 Freq: 7.80129e-08 151 # exception oop is in rax; no code emitted 151 movq RSI, RAX # spill 154 addq rsp, 80 # Destroy frame popq rbp 159 jmp rethrow_stub
Just to recap, R8-R15 are additional general-purpose 64bit registers new in the amd64.
The first part (00c-027) is allocating an array, and this is already interesting. As the comment indicates, R15 is apparently used as a pointer to a thread-local storage of the current thread, and R15[120] is the pointer to the head of the heap sub-space dedicated for this thread.
So the byte[] is allocated from this thread-local space by simply reserving 256+32 byte space. If there's not enough space (the limit is set at R15[136]), then it uses the slower allocation code at B15 — this code must involve in reserving a new chunk from the eden space and allocate a new object there.
Once the pointer to the new array is set to R8 at 00c, the initialization follows (033-071.) The first 24 bytes of the newly allocated space is used for metadata (the first 8 byte is probably lock or GC-related, followed by a pointer to the class object, then another 8 bytes for the size of the array.) 06c zero-clears the array. In theory the zero-clear shouldn't have been necessary, as we are then filling the array to zero again, but JIT failed to take advantage of that.
But note that the zero-clear is done by 8 bytes at a time, so it did recognize that the array size is multiple of 8.
I don't quite understand what those prefetch instructions (at 02b, 03a, and 049) are meant for. Presumably they are to make sure that the next time an object allocation happens, that part of the memory is in cache, but why 256, 320, and 384? Does anyone have a clue?
Now as of 074, R8 is the pointer to 'buf' and R9 is the length of the array. Note that JIT knows that buf.length is always 256 here, so this is movl R9,256 and not movl R9,[R8+16]. Also note that this computation is outside the for loop. So this tells us that there's no need to explicitly assign the array length to a temporary variable in a tight loop, because JIT does the equivalent anyway:
int len = buf.length; for( int i=0; i
Similarly there's no need to reverse the direction of the loop to avoid buf.length computation.
The way the loop is compiled is very interesting. First there's the 'warm up' part (07c-099) that presumably does the array filling until it reaches the 8-byte boundary, then the 'fast loop' portion (09b-0d3) that zero-fills 8 bytes per loop by using an MMX register, then the final 'cool down' part (0d5-100) that handles the last remaining part that doesn't fit 8 byte boundary. In this case, in theory it could have figured out that the whole thing nicely fits 8-byte boundary, so the warm up and cool down was unnecessary, but it appears that JIT didn't realize this.
I don't know what kind of computation happens behind the scene here, but overall this loop unrolling is rather clever. The original code was byte-by-byte assignment to 0, but in the final code, one loop iteration clears 8 byte at a time.
I also noticed that there's no array boundary check in the fast loop portion, which is nice.
OK, most of you have hopefully heard that in JDK6 they do lock coarsening and lock elision. So let's see that in action.
For that, I compiled the following code and executed in the same fashion:
private static void foo() { Vector v = new Vector(); v.add("abc"); v.add("def"); v.add("ghi"); }
This gives me the following:
000 B1: # B10 B2 <- BLOCK HEAD IS JUNK Freq: 20168 000 # stack bang pushq rbp subq rsp, #80 # Create frame 00c # TLS is in R15 00c movq RAX, [R15 + #120 (8-bit)] # ptr 010 movq R10, RAX # spill 013 addq R10, #40 # ptr 017 # TLS is in R15 017 cmpq R10, [R15 + #136 (32-bit)] # raw ptr 01e jge,u B10 P=0.000100 C=-1.000000 01e 024 B2: # B3 <- B1 Freq: 20166 024 # TLS is in R15 024 movq [R15 + #120 (8-bit)], R10 # ptr 028 PREFETCHNTA [R10 + #256 (32-bit)] # Prefetch to non-temporal cache for write 030 movq R10, precise klass java/util/Vector: 0x00002aaaf2649f38:Constant:exact * # ptr 03a movq R11, [R10 + #176 (32-bit)] # ptr 041 movq [RAX], R11 # ptr 044 movq [RAX + #8 (8-bit)], R10 # ptr 048 movq [RAX + #16 (8-bit)], #0 # long 050 movq [RAX + #24 (8-bit)], #0 # long 058 movq [RAX + #32 (8-bit)], #0 # long 058 060 B3: # B12 B4 <- B11 B2 Freq: 20168 060 060 movq RBP, RAX # spill 063 # checkcastPP of RBP 063 # TLS is in R15 063 movq R11, [R15 + #120 (8-bit)] # ptr 067 movq R10, R11 # spill 06a addq R10, #104 # ptr 06e # TLS is in R15 06e cmpq R10, [R15 + #136 (32-bit)] # raw ptr 075 jge,u B12 P=0.000100 C=-1.000000 075 07b B4: # B5 <- B3 Freq: 20166 07b # TLS is in R15 07b movq [R15 + #120 (8-bit)], R10 # ptr 07f PREFETCHNTA [R10 + #256 (32-bit)] # Prefetch to non-temporal cache for write 087 movq [R11], 0x0000000000000001 # ptr 08e PREFETCHNTA [R10 + #320 (32-bit)] # Prefetch to non-temporal cache for write 096 movq RDI, R11 # spill 099 addq RDI, #24 # ptr 09d PREFETCHNTA [R10 + #384 (32-bit)] # Prefetch to non-temporal cache for write 0a5 movq R10, precise klass [Ljava/lang/Object;: 0x00002aaaf264e928:Constant:exact * # ptr 0af movq [R11 + #8 (8-bit)], R10 # ptr 0b3 movl [R11 + #16 (8-bit)], #10 # int 0bb movl RCX, #10 # long (unsigned 32-bit) 0c0 xorl rax, rax # ClearArray: rep stosq # Store rax to *rdi++ while rcx-- 0c5 0c5 B5: # B16 B6 <- B13 B4 Freq: 20168 0c5 0c5 # checkcastPP of R11 0c5 movq [RBP + #32 (8-bit)], R11 # ptr ! Field java/util/Vector.elementData 0c9 movq R10, RBP # ptr -> long 0cc shrq R10, #9 0d0 movq RDX, java/lang/String:exact * # ptr 0da movq R11, 0x00002a959c9da580 # ptr 0e4 movb [R11 + R10], #0 # byte 0e9 movq RSI, RBP # spill 0ec nop # 3 bytes pad for loops and calls 0ef call,static java.util.Vector::add # Main::foo @ bci:11 L[0]=RBP # AllocatedObj(0x0000000040b30680) 0f4 0f4 B6: # B15 B7 <- B5 Freq: 20167.6 # Block is sole successor of call 0f4 movq RDX, java/lang/String:exact * # ptr 0fe movq RSI, RBP # spill 101 nop # 2 bytes pad for loops and calls 103 call,static java.util.Vector::add # Main::foo @ bci:18 L[0]=RBP # AllocatedObj(0x0000000040b30680) 108 108 B7: # B14 B8 <- B6 Freq: 20167.2 # Block is sole successor of call 108 movq RDX, java/lang/String:exact * # ptr 112 movq RSI, RBP # spill 115 nop # 2 bytes pad for loops and calls 117 call,static java.util.Vector::add # Main::foo @ bci:25 L[0]=_ # 11c 11c B8: # N1 <- B7 Freq: 20166.8 # Block is sole successor of call 11c addq rsp, 80 # Destroy frame popq rbp testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC 127 ret (slow path omitted)
The allocation of a Vector object (00c-058) is almost identical to the array allocation code we've seen before (except the additional field initializations at 048-058.) The array allocation for Vector.elementData follows (060-0C0.)
Note that the Vector constructors are defined in highly nested fashion like this:
public Vector(int initialCapacity, int capacityIncrement) { super(); if (initialCapacity < 0) throw new IllegalArgumentException("Illegal Capacity: "+ initialCapacity); this.elementData = new Object[initialCapacity]; this.capacityIncrement = capacityIncrement; } public Vector(int initialCapacity) { this(initialCapacity, 0); } public Vector() { this(10); }
... but the whole thing is inlined, so the end result is just as fast as the following code. This is great.
public Vector() { this.elementData = new Object[10]; this.capacityIncrement = 0; }
But wait, after that, you see that there's three call instructions for Vector.add. So there's no lock elision nor lock coarsening, despite the fact that this Vector object never escapes the stack.
I thought perhaps that's because Vector.add is too complex to be inlined, so I tried the following code, in the hope of seeing the lock elision:
This produced the following code:private static void foo() { Foo foo = new Foo(); foo.inc(); foo.inc(); foo.inc(); } private static final class Foo { int i=0; public synchronized void inc() { i++; } }
000 B1: # B6 B2 <- BLOCK HEAD IS JUNK Freq: 19972 000 # stack bang pushq rbp subq rsp, #80 # Create frame 00c # TLS is in R15 00c movq RBP, [R15 + #120 (8-bit)] # ptr 010 movq R10, RBP # spill 013 addq R10, #24 # ptr 017 cmpq R10, [R15 + #136 (32-bit)] # raw ptr 01e jge,u B6 P=0.000100 C=-1.000000 01e 024 B2: # B3 <- B1 Freq: 19970 024 movq [R15 + #120 (8-bit)], R10 # ptr 028 PREFETCHNTA [R10 + #256 (32-bit)] # Prefetch to non-temporal cache for write 030 movq R10, precise klass Main$Foo: 0x00002aaaf2646e58:Constant:exact * # ptr 03a movq R11, [R10 + #176 (32-bit)] # ptr 041 movq [RBP], R11 # ptr 045 movq [RBP + #8 (8-bit)], R10 # ptr 049 movq [RBP + #16 (8-bit)], #0 # long 049 051 B3: # B8 B4 <- B7 B2 Freq: 19972 051 051 # checkcastPP of RBP 051 leaq R11, [rsp + #64] # box lock 056 fastlock RBP,R11,RAX,R10 135 jne B8 P=0.000001 C=-1.000000 135 13b B4: # B9 B5 <- B8 B3 Freq: 19972 13b MEMBAR-acquire (prior CMPXCHG in FastLock so empty encoding) 13b movl R11, [RBP + #16 (8-bit)] # int ! Field Main$Foo.i 13f incl R11 # int 142 movl [RBP + #16 (8-bit)], R11 # int ! Field Main$Foo.i 146 MEMBAR-release 146 MEMBAR-acquire 146 movl R11, [RBP + #16 (8-bit)] # int ! Field Main$Foo.i 14a incl R11 # int 14d movl [RBP + #16 (8-bit)], R11 # int ! Field Main$Foo.i 151 MEMBAR-release 151 MEMBAR-acquire 151 movl R11, [RBP + #16 (8-bit)] # int ! Field Main$Foo.i 155 incl R11 # int 158 movl [RBP + #16 (8-bit)], R11 # int ! Field Main$Foo.i 15c MEMBAR-release (a FastUnlock follows so empty encoding) 15c leaq RAX, [rsp + #64] # box lock 161 fastunlock RBP, RAX, R10 218 jne,s B9 P=0.000001 C=-1.000000 218 21a B5: # N1 <- B9 B4 Freq: 19972 21a addq rsp, 80 # Destroy frame popq rbp testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC 225 ret (slow path omitted)
We are all familar with the memory allocation by now, so we can skip that.
The 'fastlock' pseudo-instruction (AFAIK there's no such operation in amd64, and a single machine code can't possibly occupy 223 bytes!) must be the lock code. Here you see that the lock coarsening has indeed happened (yay!), and three increments happen in a single block (MEMBAR-acquire/release must be another pseudo-instruction, which became no-op in this scenario — see that the length of those instructions are 0).
Note that JVM still fails to eliminate a lock here, despite the fact that this object doesn't escape the stack. I tried various things to see the effect of escape analysis and lock elision kick in, but couldn't find a way to do it. It looks like this feature is not quite in JDK yet, although it's equally possible that I'm doing something stupid.
Also note that presumably because of the memory barrier associated with this, each increments write back to memory. This is unfortunate because in theory three increments could have been combined into one, given the the lock was coarsened.
Indeed if I remove the 'synchronized' keyword, I get the following substantially simpler version:
000 B1: # B4 B2 <- BLOCK HEAD IS JUNK Freq: 27066 000 # stack bang pushq rbp subq rsp, #16 # Create frame 00c # TLS is in R15 00c movq RAX, [R15 + #120 (8-bit)] # ptr 010 movq R10, RAX # spill 013 addq R10, #24 # ptr 017 cmpq R10, [R15 + #136 (32-bit)] # raw ptr 01e jge,us B4 P=0.000100 C=-1.000000 01e 020 B2: # B3 <- B1 Freq: 27063.3 020 movq [R15 + #120 (8-bit)], R10 # ptr 024 PREFETCHNTA [R10 + #256 (32-bit)] # Prefetch to non-temporal cache for write 02c movq R10, precise klass Main$Foo: 0x00002aaaf25dfbc8:Constant:exact * # ptr 036 movq R11, [R10 + #176 (32-bit)] # ptr 03d movq [RAX], R11 # ptr 040 movq [RAX + #8 (8-bit)], R10 # ptr 040 044 B3: # N1 <- B5 B2 Freq: 27066 044 movq [RAX + #16 (8-bit)], #3 # long 04c 04c addq rsp, 16 # Destroy frame popq rbp testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC 057 ret (slow path omitted)
So not only three inc()s but the initializer also got collapsed into single "movq rax[16],3" call. Wow!
All in all, modern JVMs seem pretty good at generating optimal code. In various situations, the resulting assembly code is far from the straight-forward instruction-by-instruction translation. OTOH, the escape analysis doesn't really seem to do anything useful yet.
This was a long post, but I hope you enjoyed this as much as I did.
- Login or register to post comments
- Printer-friendly version
- kohsuke's blog
- 8239 reads






Comments
by jarkko - 2008-12-03 18:34
Are there any studies how combined client-server compiler would perform? E.g after first 1000 executions/loops compile the code with client compiler then after 9000 more executions/loops compile it with server compiler.by opinali - 2008-04-01 13:13
kohsuke, sorry the bad post - I must have missed/typo'ed the ending PRE tag, didn't use the preview... can you fix that?by kohsuke - 2008-04-01 13:11
opinali -- yes, but that's with Foo.inc() method. Where I was complaining, I was complaining about the lack of lock coarsening in Vector.add.
So I guess the best we can say is that the lock coarsening does work in some situations.
by opinali - 2008-04-01 13:05
I'm also not a HotSpot expert, but it seems to me that lock coarseing is working for your foo() method. Your Assembly listing shows that the lock/unlock operations happen only once: 051 leaq R11, [rsp + #64] # box lock 056 fastlock RBP,R11,RAX,R10 135 jne B8 P=0.000001 C=-1.000000 ... three copies of the foo.inc() code... 15c leaq RAX, [rsp + #64] # box lock 161 fastunlock RBP, RAX, R10 218 jne,s B9 P=0.000001 C=-1.000000 (There's also the out-of-line, slow locking code for inflated locks, at labels B8 and B9, that you don't show.) HotSpot is just not coarsening the memory barrier operations that are necessary to preserve JMM semantics. Apparently HotSpot's lock optimizations are not sufficiently smart to risk messing with happens-before constraints. And because the barriers are not elided, HotSpot can't also perform additional opts like constant propagation etc.by kohsuke - 2008-04-01 08:09
mstanik -- sorry, that's just not true. You probably mean -XX:+PrintAssembly. As I wrote in this blog, all it takes for you to do this is a debug build of JDK.by opinali - 2008-04-01 19:12
Yep, I focused on the inc() example as you claimed that "JVM still fails to eliminate a lock here, despite the fact that this object doesn't escape the stack." You should be correct that the Vector.add() code is even worse because it's not inlined... this is an effect of callee-side locking. Suggestion: HotSpot could compile synchronized methods with a stub mechanism, i.e. a stub that does only the synchronization and calls the main code, so this main code could be invoked directly from callers that can optimize out the locking but cannot inline. This would have minimal cost for invocations into the synchronized stub (an extra CALL and RET), but would help optimize other interesting scenarios, like recursive calls and reentrant synchronization (when a given class has two synchronized methods a and b, and a invokes b, again without inlining).by kohsuke - 2008-04-01 14:27
Done.by cliffc - 2008-05-01 09:20
Sorry I'm late to the party.I'm the architect of the -server JIT.
If you have any lingering questions, please re-post them and I'll try to answer them.
As for the prefetch of 256+n*64 - that is so the new allocations hit in cache. Allocation memory is typically "very old" - not touched for a long time (probably since the last GC cycle) and hence isn't in any cache layer. The prefetch during *this* allocation means that the *next* allocation runs faster.
Also, I can print Azul assembly for those who are interested.
Cliff
by kohsuke - 2008-05-01 10:28
cliffc -- thanks for the clarification. That's kind of what I suspected, but good to get a confirmation. I'm sure Azul assembly post (or in fact any insights into any JVM) would fascinate many folks, especially if it's coming from you.by jrose - 2008-04-05 21:29
mstanik — we're getting there; see http://wikis.sun.com/display/HotSpotInternals/PrintAssembly .by kohsuke - 2008-04-04 14:58
art_ -- thank you for the suggestion. I do have SPARC systems, but I think I'll leave the exercise to someone else.by kohsuke - 2008-04-04 14:56
opinali -- I did say the lock coarsening happened with Foo.inc, but it's still doing a lock once, which it shouldn't have done, if the escape analysis was working as I expected.
The double-call approach you mention is interesting; maybe you should have a JVM :-)
by art_ - 2008-04-04 12:45
Just a small suggestion if you will: it'd be interesting to see the assembly code produced for SPARC if you have access to such a machine and the Sun JVM on SPARC supports the same options. A lot of university's teach both Java and SPARC or MIPS assembly programming. SPARC assembly is also, IMHO, easier to read than x86. SPARC output would make this article a useful resource for many teachers, professors, and students.by cliffc - 2008-05-15 20:50
Also -XX:+PrintCompilation will tell you when something gets JIT'd. The default -client compiler does much less optimization, and compiles fairly quickly, but certainly not run-once code. The -server compiler only compiles things that have executed or looped 10000 or more times, but it compiles with a much higher level of optimization. Cliffby kcpeppe - 2008-04-08 00:27
Awesome post! As for the locking optimizations, Jeroen Borgers has written a very nice article asking the question, do these lock optimizations really work. It will be published on InfoQ as soon as I finish editing it 8^)by genepi - 2008-04-07 15:03
All this optimization stuff is quite nice and shows that Sun engineers have packed a great deal of intelligence in the Hotspot compiler. But a question remains: couldn't we get still better optimizations if we used an ahead of time compiler? or if Hostspot was able to save and re-use the compilation results between runs?by asymtote - 2008-04-07 17:25
The non-temporal prefetches are indeed interesting. In the 256 array clearing example they are 64 bytes apart, which makes sense as that matches the width of a cache line. They are non-temporal, which is interesting as it means that when that line is evicted from the L1 data cache it will be sent directly to memory and not to L2. This means the compiler believes that that data will be written to in the near'ish future but not read from. The choice of memory locations is also puzzling as they would appear to be at the end of the array being allocated and beyond, unless I've got my math wrong?by kohsuke - 2008-04-07 16:57
genepi -- Isn't Java already supposed to run faster than C on many benchmarks? So in that sense, isn't that argument already pretty much decided? I do however wonder if there can be some ahead of the time compilation to improve the start up performance, possibly with later recompilation as HotSpot acquires more information.by kohsuke - 2008-05-13 17:54
OK, so now there are two of you. I'll see try the latest version then. I did this experiment at home, so I can swear that I downloaded it from a public website.by etni3s - 2008-05-13 17:48
kohsuke: Are you able to reproduce the same features in new builds? I'm having the same problems as above users. Running java 1.6.0_10-beta-fastdebug-b23 in a Windows 32bit / Cygwin environment. Got a bit discouraged after reading this: http://blogs.tedneward.com/2008/04/06/The+Complexities+Of+Black+Boxes.aspxby kohsuke - 2008-05-12 10:53
All I needed to do was "java -XX:+PrintOptoAssembly -server -cp . Main" as explained in the post. Are you sure you are running the right version of JVM, as opposed to the one in your PATH?by aviadbd - 2008-05-13 09:43
Yeah, I changed the JAVA_HOME, and then alias'ed "java" to "JAVA_HOME/bin/java" - and writing "java -version" gave me the correct version (I wrote it in the last comment). Think I'm missing something?by aviadbd - 2008-05-11 06:57
Great post! But how did you get it to work? I downloaded "1.6.0_10-beta-fastdebug-b22" and I don't see the output you see. Was there a change from b14 to b22 with the PrintOptoAssembly option?All I get is a hotspot.log file in an XML format, which doesn't contain any assembly code..
by ijuma - 2008-03-31 09:39
Interesting, I vaguely remember seeing some benchmarks showing that StringBuffer had no overhead in comparison to StringBuilder when that option was enabled. I didn't try it myself though.by kohsuke - 2008-03-31 09:26
ijuma -- Yes, I tried "-XX:+DoEscapeAnalysis" and "-XX:+EliminateLocks" but I didn't notice any difference in the compiled code.by ijuma - 2008-03-31 02:52
Hi, Nice post. Regarding escape analysis, I _think_ you have to enable it by adding: -XX:+DoEscapeAnalysis Regards, Ismaelby anjanb2 - 2008-03-31 00:08
cool++by mstanik - 2008-03-31 23:41
Unfortunately, the crucial option (-XX:+PrintOptoAssembly) is available only to sun guys. This option requires disassembler library which is not available outside sun (despite it is is based on gnu sources). For more than 1 year i am regularly checking this issue and there is still no visible progress.by etni3s - 2008-05-14 17:56
seems i can't post 'less than' chars here, so to continue above post: Changing i less than 100 to 100000 worked. Thanks for a nice blogpost.by etni3s - 2008-05-14 17:54
Ah ok. Didn't know there were size/time limitations. When trying to run the first "trivial" example above nothing happened. But changing "i<100" to "i<100000" did the trick. Thanks for the nice article. :)by kohsuke - 2008-05-14 09:19
I just tried "1.6.0_10-beta-fastdebug-b23" on Linux amd64 and it did work. Note that the assembly dump is produced only when the code is JIT-compiled, so you do have to run the program for some duration to see any output.So for example if you just do: public class Main { public static void main(String[] args){ System.out.println("Hello"); } then you won't see anything. Try something like this instead: public class Main { public static void main(String[] args) { for(int i=0;i<100000;i++) bar(); } public static void bar() { for(int i=0;i<100000;i++) foo(); } public static void foo() {} }
Also, some people claimed that this option only works in the -server JVM, but that is not true. The only reason I used the -server switch is because I wanted to see what JVM is really capable of, and I've heard that there are several optimizations only available in the server JVM.