Skip to main content

Deep dive into assembly code from Java

Posted by kohsuke on March 30, 2008 at 10:10 PM PDT

One of the things I learned in The Server Side Java Symposium 2008 was a command-line option to print out the assembly code that JIT is producing. Since I've always been interested in seeing the final assembly code that gets produced from your Java code, I decided to give it a test drive.

First the disclaimers:

  1. I'm not a performance expert.
  2. Don't try to take this too far, like optimizing your code against what you see here.

The option in question is only available in debug builds of JDKs. You can download one from here. The binary I tested is JDK6 u10 b14.

$ java -fullversion
java full version "1.6.0_10-beta-fastdebug-b14"

First, let's try something trivial:

>public class Main {
    public static void main(String[] args) {
        for(int i=0; i<100; i++)
            foo();
    }

    private static void foo() {
        for(int i=0; i<100; i++)
            bar();
    }

    private static void bar() {
    }
}

I run this like "java -XX:+PrintOptoAssembly -server -cp . Main". The -XX:+PrintOptoAssembly is the magic option, and with this option I get the following, which shows the code of the "foo" method:

>000   B1: #     N1 <- BLOCK HEAD IS JUNK   Freq: 100
000     pushq   rbp
        subq    rsp, #16        # Create frame
        nop     # nop for patch_verified_entry
006     addq    rsp, 16 # Destroy frame
        popq    rbp
        testl   rax, [rip + #offset_to_poll_page]       # Safepoint: poll for GC

011     ret

You see that the entire bar() function call and the loop was optimized away. So it must have inlined the bar() method, then unrolled the loop.

Now to something more interesting:

>    private static byte[] foo() {
        byte[] buf = new byte[256];
        for( int i=0; i            buf[i] = 0;
        return buf;
    }

This produces the following code:

>000   B1: #     B15 B2 <- BLOCK HEAD IS JUNK   Freq: 78
000     # stack bang
        pushq   rbp
        subq    rsp, #80        # Create frame
00c     # TLS is in R15
00c     movq    R8, [R15 + #120 (8-bit)]        # ptr
010     movq    R10, R8 # spill
013     addq    R10, #280       # ptr
01a     cmpq    R10, [R15 + #136 (32-bit)]      # raw ptr
021     jge,u   B15  P=0.000100 C=-1.000000
021
027   B2: #     B3 <- B1  Freq: 77.9922
027     movq    [R15 + #120 (8-bit)], R10       # ptr
02b     PREFETCHNTA [R10 + #256 (32-bit)]       # Prefetch to non-temporal cache for write
033     movq    [R8], 0x0000000000000001        # ptr
03a     PREFETCHNTA [R10 + #320 (32-bit)]       # Prefetch to non-temporal cache for write
042     movq    RDI, R8 # spill
045     addq    RDI, #24        # ptr
049     PREFETCHNTA [R10 + #384 (32-bit)]       # Prefetch to non-temporal cache for write
051     movl    RCX, #32        # long (unsigned 32-bit)
056     movq    R10, precise klass [B: 0x00002aaaab076708:Constant:exact *      # ptr
060     movq    [R8 + #8 (8-bit)], R10  # ptr
064     movl    [R8 + #16 (8-bit)], #256        # int
06c     xorl    rax, rax        # ClearArray:
        rep stosq       # Store rax to *rdi++ while rcx--
071
071   B3: #     B4 <- B16 B2  Freq: 78
071  
071     # checkcastPP of R8
071     xorl    R10, R10        # int
074     movl    R9, #256        # int
        nop     # 2 bytes pad for loops and calls

07c   B4: #     B17 B5 <- B3 B5         Loop: B4-B5 inner stride: not constant pre of N153 Freq: 19850.2
07c     cmpl    R10, #256       # unsigned
083     jge,u   B17  P=0.000001 C=-1.000000
083
089   B5: #     B4 B6 <- B4  Freq: 19850.2
089     movslq  R11, R10        # i2l
08c     movb    [R8 + #24 + R11], #0    # byte
092     incl    R10     # int
095     cmpl    R10, #8
099     jlt,s   B4  P=0.996072 C=22313.000000
099
09b   B6: #     B11 B7 <- B5  Freq: 77.9799
09b     subl    R9, R10 # int
09e     andl    R9, #-16        # int
0a2     addl    R9, R10 # int
0a5     cmpl    R10, R9
0a8     jge,s   B11  P=0.500000 C=-1.000000
0a8
0aa   B7: #     B8 <- B6  Freq: 38.9899
0aa     PXOR  XMM0,XMM0 ! replicate8B
        nop     # 2 bytes pad for loops and calls

0b0   B8: #     B10 B9 <- B7 B9         Loop: B8-B9 inner stride: not constant main of N85 Freq: 9925.09
0b0     movslq  R11, R10        # i2l
0b3     MOVQ  [R8 + #24 + R11],XMM0     ! packed8B
0ba     movl    R11, R10        # spill
0bd     addl    R11, #16        # int
0c1     movslq  R10, R10        # i2l
0c4     MOVQ  [R8 + #32 + R10],XMM0     ! packed8B
0cb     cmpl    R11, R9
0ce     jge,s   B10  P=0.003928 C=22313.000000
0ce
0d0   B9: #     B8 <- B8  Freq: 9886.1
0d0     movl    R10, R11        # spill
0d3     jmp,s   B8
0d3
0d5   B10: #    B11 <- B8  Freq: 38.9899
0d5     movl    R10, R11        # spill
0d5
0d8   B11: #    B14 B12 <- B6 B10  Freq: 77.9799
0d8     cmpl    R10, #256
0df     jge,s   B14  P=0.500000 C=-1.000000
        nop     # 3 bytes pad for loops and calls

0e4   B12: #    B17 B13 <- B11 B13      Loop: B12-B13 inner stride: not constant post of N153 Freq: 9922.54
0e4     cmpl    R10, #256       # unsigned
0eb     jge,us  B17  P=0.000001 C=-1.000000
0eb
0ed   B13: #    B12 B14 <- B12  Freq: 9922.53
0ed     movslq  R11, R10        # i2l
0f0     movb    [R8 + #24 + R11], #0    # byte
0f6     incl    R10     # int
0f9     cmpl    R10, #256
100     jlt,s   B12  P=0.996072 C=22313.000000
100
102   B14: #    N1 <- B13 B11  Freq: 77.9698
102     movq    RAX, R8 # spill
105     addq    rsp, 80 # Destroy frame
        popq    rbp
        testl   rax, [rip + #offset_to_poll_page]       # Safepoint: poll for GC

110     ret
110
111   B15: #    B18 B16 <- B1  Freq: 0.00780129
111     movq    RSI, precise klass [B: 0x00002aaaab076708:Constant:exact *      # ptr
11b     movl    RDX, #256       # int
120     nop     # 3 bytes pad for loops and calls
123     call,static  wrapper for: _new_array_Java
        # Main::foo @ bci:3  L[0]=_ L[1]=_
        #
128
128   B16: #    B3 <- B15  Freq: 0.00780114
        # Block is sole successor of call
128     movq    R8, RAX # spill
12b     jmp     B3
12b
130   B17: #    N1 <- B12 B4  Freq: 1e-06
130     movl    RSI, #-28       # int
135     movq    RBP, R8 # spill
138     movl    [rsp + #0], R10 # spill
13c     nop     # 3 bytes pad for loops and calls
13f     call,static  wrapper for: uncommon_trap(reason='range_check' action='make_not_entrant')
        # Main::foo @ bci:17  L[0]=RBP L[1]=rsp + #0 STK[0]=RBP STK[1]=rsp + #0 STK[2]=#0
        # AllocatedObj(0x0000000040c31880)

144     int3    # ShouldNotReachHere
144
151   B18: #    N1 <- B15  Freq: 7.80129e-08
151     # exception oop is in rax; no code emitted
151     movq    RSI, RAX        # spill
154     addq    rsp, 80 # Destroy frame
        popq    rbp

159     jmp     rethrow_stub

Just to recap, R8-R15 are additional general-purpose 64bit registers new in the amd64.

The first part (00c-027) is allocating an array, and this is already interesting. As the comment indicates, R15 is apparently used as a pointer to a thread-local storage of the current thread, and R15[120] is the pointer to the head of the heap sub-space dedicated for this thread.

So the byte[] is allocated from this thread-local space by simply reserving 256+32 byte space. If there's not enough space (the limit is set at R15[136]), then it uses the slower allocation code at B15 — this code must involve in reserving a new chunk from the eden space and allocate a new object there.

Once the pointer to the new array is set to R8 at 00c, the initialization follows (033-071.) The first 24 bytes of the newly allocated space is used for metadata (the first 8 byte is probably lock or GC-related, followed by a pointer to the class object, then another 8 bytes for the size of the array.) 06c zero-clears the array. In theory the zero-clear shouldn't have been necessary, as we are then filling the array to zero again, but JIT failed to take advantage of that.

But note that the zero-clear is done by 8 bytes at a time, so it did recognize that the array size is multiple of 8.

I don't quite understand what those prefetch instructions (at 02b, 03a, and 049) are meant for. Presumably they are to make sure that the next time an object allocation happens, that part of the memory is in cache, but why 256, 320, and 384? Does anyone have a clue?

Now as of 074, R8 is the pointer to 'buf' and R9 is the length of the array. Note that JIT knows that buf.length is always 256 here, so this is movl R9,256 and not movl R9,[R8+16]. Also note that this computation is outside the for loop. So this tells us that there's no need to explicitly assign the array length to a temporary variable in a tight loop, because JIT does the equivalent anyway:

>int len = buf.length;
for( int i=0; i  ...

Similarly there's no need to reverse the direction of the loop to avoid buf.length computation.

The way the loop is compiled is very interesting. First there's the 'warm up' part (07c-099) that presumably does the array filling until it reaches the 8-byte boundary, then the 'fast loop' portion (09b-0d3) that zero-fills 8 bytes per loop by using an MMX register, then the final 'cool down' part (0d5-100) that handles the last remaining part that doesn't fit 8 byte boundary. In this case, in theory it could have figured out that the whole thing nicely fits 8-byte boundary, so the warm up and cool down was unnecessary, but it appears that JIT didn't realize this.

I don't know what kind of computation happens behind the scene here, but overall this loop unrolling is rather clever. The original code was byte-by-byte assignment to 0, but in the final code, one loop iteration clears 8 byte at a time.

I also noticed that there's no array boundary check in the fast loop portion, which is nice.

OK, most of you have hopefully heard that in JDK6 they do lock coarsening and lock elision. So let's see that in action.

For that, I compiled the following code and executed in the same fashion:

>    private static void foo() {
        Vector v = new Vector();
        v.add("abc");
        v.add("def");
        v.add("ghi");
    }

This gives me the following:

>000   B1: #     B10 B2 <- BLOCK HEAD IS JUNK   Freq: 20168
000     # stack bang
        pushq   rbp
        subq    rsp, #80        # Create frame
00c     # TLS is in R15
00c     movq    RAX, [R15 + #120 (8-bit)]       # ptr
010     movq    R10, RAX        # spill
013     addq    R10, #40        # ptr
017     # TLS is in R15
017     cmpq    R10, [R15 + #136 (32-bit)]      # raw ptr
01e     jge,u   B10  P=0.000100 C=-1.000000
01e
024   B2: #     B3 <- B1  Freq: 20166
024     # TLS is in R15
024     movq    [R15 + #120 (8-bit)], R10       # ptr
028     PREFETCHNTA [R10 + #256 (32-bit)]       # Prefetch to non-temporal cache for write
030     movq    R10, precise klass java/util/Vector: 0x00002aaaf2649f38:Constant:exact *        # ptr
03a     movq    R11, [R10 + #176 (32-bit)]      # ptr
041     movq    [RAX], R11      # ptr
044     movq    [RAX + #8 (8-bit)], R10 # ptr
048     movq    [RAX + #16 (8-bit)], #0 # long
050     movq    [RAX + #24 (8-bit)], #0 # long
058     movq    [RAX + #32 (8-bit)], #0 # long
058
060   B3: #     B12 B4 <- B11 B2  Freq: 20168
060    
060     movq    RBP, RAX        # spill
063     # checkcastPP of RBP
063     # TLS is in R15
063     movq    R11, [R15 + #120 (8-bit)]       # ptr
067     movq    R10, R11        # spill
06a     addq    R10, #104       # ptr
06e     # TLS is in R15
06e     cmpq    R10, [R15 + #136 (32-bit)]      # raw ptr
075     jge,u   B12  P=0.000100 C=-1.000000
075
07b   B4: #     B5 <- B3  Freq: 20166
07b     # TLS is in R15
07b     movq    [R15 + #120 (8-bit)], R10       # ptr
07f     PREFETCHNTA [R10 + #256 (32-bit)]       # Prefetch to non-temporal cache for write
087     movq    [R11], 0x0000000000000001       # ptr
08e     PREFETCHNTA [R10 + #320 (32-bit)]       # Prefetch to non-temporal cache for write
096     movq    RDI, R11        # spill
099     addq    RDI, #24        # ptr
09d     PREFETCHNTA [R10 + #384 (32-bit)]       # Prefetch to non-temporal cache for write
0a5     movq    R10, precise klass [Ljava/lang/Object;: 0x00002aaaf264e928:Constant:exact *     # ptr
0af     movq    [R11 + #8 (8-bit)], R10 # ptr
0b3     movl    [R11 + #16 (8-bit)], #10        # int
0bb     movl    RCX, #10        # long (unsigned 32-bit)
0c0     xorl    rax, rax        # ClearArray:
        rep stosq       # Store rax to *rdi++ while rcx--
0c5
0c5   B5: #     B16 B6 <- B13 B4  Freq: 20168
0c5    
0c5     # checkcastPP of R11
0c5     movq    [RBP + #32 (8-bit)], R11        # ptr ! Field java/util/Vector.elementData
0c9     movq    R10, RBP        # ptr -> long
0cc     shrq    R10, #9
0d0     movq    RDX, java/lang/String:exact *   # ptr
0da     movq    R11, 0x00002a959c9da580 # ptr
0e4     movb    [R11 + R10], #0 # byte
0e9     movq    RSI, RBP        # spill
0ec     nop     # 3 bytes pad for loops and calls
0ef     call,static  java.util.Vector::add
        # Main::foo @ bci:11  L[0]=RBP
        # AllocatedObj(0x0000000040b30680)

0f4
0f4   B6: #     B15 B7 <- B5  Freq: 20167.6
        # Block is sole successor of call
0f4     movq    RDX, java/lang/String:exact *   # ptr
0fe     movq    RSI, RBP        # spill
101     nop     # 2 bytes pad for loops and calls
103     call,static  java.util.Vector::add
        # Main::foo @ bci:18  L[0]=RBP
        # AllocatedObj(0x0000000040b30680)

108
108   B7: #     B14 B8 <- B6  Freq: 20167.2
        # Block is sole successor of call
108     movq    RDX, java/lang/String:exact *   # ptr
112     movq    RSI, RBP        # spill
115     nop     # 2 bytes pad for loops and calls
117     call,static  java.util.Vector::add
        # Main::foo @ bci:25  L[0]=_
        #
11c
11c   B8: #     N1 <- B7  Freq: 20166.8
        # Block is sole successor of call
11c     addq    rsp, 80 # Destroy frame
        popq    rbp
        testl   rax, [rip + #offset_to_poll_page]       # Safepoint: poll for GC
       
127     ret

(slow path omitted)

The allocation of a Vector object (00c-058) is almost identical to the array allocation code we've seen before (except the additional field initializations at 048-058.) The array allocation for Vector.elementData follows (060-0C0.)

Note that the Vector constructors are defined in highly nested fashion like this:

>    public Vector(int initialCapacity, int capacityIncrement) {
        super();
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal Capacity: "+
                                               initialCapacity);
        this.elementData = new Object[initialCapacity];
        this.capacityIncrement = capacityIncrement;
    }

    public Vector(int initialCapacity) {
        this(initialCapacity, 0);
    }

    public Vector() {
        this(10);
    }

... but the whole thing is inlined, so the end result is just as fast as the following code. This is great.

>    public Vector() {
        this.elementData = new Object[10];
        this.capacityIncrement = 0;
    }

But wait, after that, you see that there's three call instructions for Vector.add. So there's no lock elision nor lock coarsening, despite the fact that this Vector object never escapes the stack.

I thought perhaps that's because Vector.add is too complex to be inlined, so I tried the following code, in the hope of seeing the lock elision:

>    private static void foo() {
        Foo foo = new Foo();
        foo.inc();
        foo.inc();
        foo.inc();
    }

    private static final class Foo {
        int i=0;

        public synchronized void inc() {
            i++;
        }
    }

This produced the following code:

>000   B1: #     B6 B2 <- BLOCK HEAD IS JUNK   Freq: 19972
000     # stack bang
        pushq   rbp
        subq    rsp, #80        # Create frame
00c     # TLS is in R15
00c     movq    RBP, [R15 + #120 (8-bit)]       # ptr
010     movq    R10, RBP        # spill
013     addq    R10, #24        # ptr
017     cmpq    R10, [R15 + #136 (32-bit)]      # raw ptr
01e     jge,u   B6  P=0.000100 C=-1.000000
01e
024   B2: #     B3 <- B1  Freq: 19970
024     movq    [R15 + #120 (8-bit)], R10       # ptr
028     PREFETCHNTA [R10 + #256 (32-bit)]       # Prefetch to non-temporal cache for write
030     movq    R10, precise klass Main$Foo: 0x00002aaaf2646e58:Constant:exact *        # ptr
03a     movq    R11, [R10 + #176 (32-bit)]      # ptr
041     movq    [RBP], R11      # ptr
045     movq    [RBP + #8 (8-bit)], R10 # ptr
049     movq    [RBP + #16 (8-bit)], #0 # long
049
051   B3: #     B8 B4 <- B7 B2  Freq: 19972
051    
051     # checkcastPP of RBP
051     leaq    R11, [rsp + #64]        # box lock
056     fastlock RBP,R11,RAX,R10
135     jne     B8  P=0.000001 C=-1.000000
135
13b   B4: #     B9 B5 <- B8 B3  Freq: 19972
13b     MEMBAR-acquire (prior CMPXCHG in FastLock so empty encoding)
13b     movl    R11, [RBP + #16 (8-bit)]        # int ! Field Main$Foo.i
13f     incl    R11     # int
142     movl    [RBP + #16 (8-bit)], R11        # int ! Field Main$Foo.i
146     MEMBAR-release
146     MEMBAR-acquire
146     movl    R11, [RBP + #16 (8-bit)]        # int ! Field Main$Foo.i
14a     incl    R11     # int
14d     movl    [RBP + #16 (8-bit)], R11        # int ! Field Main$Foo.i
151     MEMBAR-release
151     MEMBAR-acquire
151     movl    R11, [RBP + #16 (8-bit)]        # int ! Field Main$Foo.i
155     incl    R11     # int
158     movl    [RBP + #16 (8-bit)], R11        # int ! Field Main$Foo.i
15c     MEMBAR-release (a FastUnlock follows so empty encoding)
15c     leaq    RAX, [rsp + #64]        # box lock
161     fastunlock RBP, RAX, R10
218     jne,s   B9  P=0.000001 C=-1.000000
218
21a   B5: #     N1 <- B9 B4  Freq: 19972
21a     addq    rsp, 80 # Destroy frame
        popq    rbp
        testl   rax, [rip + #offset_to_poll_page]       # Safepoint: poll for GC
       
225     ret

(slow path omitted)

We are all familar with the memory allocation by now, so we can skip that.

The 'fastlock' pseudo-instruction (AFAIK there's no such operation in amd64, and a single machine code can't possibly occupy 223 bytes!) must be the lock code. Here you see that the lock coarsening has indeed happened (yay!), and three increments happen in a single block (MEMBAR-acquire/release must be another pseudo-instruction, which became no-op in this scenario — see that the length of those instructions are 0).

Note that JVM still fails to eliminate a lock here, despite the fact that this object doesn't escape the stack. I tried various things to see the effect of escape analysis and lock elision kick in, but couldn't find a way to do it. It looks like this feature is not quite in JDK yet, although it's equally possible that I'm doing something stupid.

Also note that presumably because of the memory barrier associated with this, each increments write back to memory. This is unfortunate because in theory three increments could have been combined into one, given the the lock was coarsened.

Indeed if I remove the 'synchronized' keyword, I get the following substantially simpler version:

>000   B1: #     B4 B2 <- BLOCK HEAD IS JUNK   Freq: 27066
000     # stack bang
        pushq   rbp
        subq    rsp, #16        # Create frame
00c     # TLS is in R15
00c     movq    RAX, [R15 + #120 (8-bit)]       # ptr
010     movq    R10, RAX        # spill
013     addq    R10, #24        # ptr
017     cmpq    R10, [R15 + #136 (32-bit)]      # raw ptr
01e     jge,us  B4  P=0.000100 C=-1.000000
01e
020   B2: #     B3 <- B1  Freq: 27063.3
020     movq    [R15 + #120 (8-bit)], R10       # ptr
024     PREFETCHNTA [R10 + #256 (32-bit)]       # Prefetch to non-temporal cache for write
02c     movq    R10, precise klass Main$Foo: 0x00002aaaf25dfbc8:Constant:exact *        # ptr
036     movq    R11, [R10 + #176 (32-bit)]      # ptr
03d     movq    [RAX], R11      # ptr
040     movq    [RAX + #8 (8-bit)], R10 # ptr
040
044   B3: #     N1 <- B5 B2  Freq: 27066
044     movq    [RAX + #16 (8-bit)], #3 # long
04c    
04c     addq    rsp, 16 # Destroy frame
        popq    rbp
        testl   rax, [rip + #offset_to_poll_page]       # Safepoint: poll for GC
       
057     ret

(slow path omitted)

So not only three inc()s but the initializer also got collapsed into single "movq rax[16],3" call. Wow!

All in all, modern JVMs seem pretty good at generating optimal code. In various situations, the resulting assembly code is far from the straight-forward instruction-by-instruction translation. OTOH, the escape analysis doesn't really seem to do anything useful yet.

This was a long post, but I hope you enjoyed this as much as I did.

Related Topics >>

Comments

Are there any studies how combined client-server compiler would perform? E.g after first 1000 executions/loops compile the code with client compiler then after 9000 more executions/loops compile it with server compiler.

kohsuke, sorry the bad post - I must have missed/typo'ed the ending PRE tag, didn't use the preview... can you fix that?

opinali -- yes, but that's with Foo.inc() method. Where I was complaining, I was complaining about the lack of lock coarsening in Vector.add.

So I guess the best we can say is that the lock coarsening does work in some situations.

I'm also not a HotSpot expert, but it seems to me that lock coarseing is working for your foo() method. Your Assembly listing shows that the lock/unlock operations happen only once: 051 leaq R11, [rsp + #64] # box lock 056 fastlock RBP,R11,RAX,R10 135 jne B8 P=0.000001 C=-1.000000 ... three copies of the foo.inc() code... 15c leaq RAX, [rsp + #64] # box lock 161 fastunlock RBP, RAX, R10 218 jne,s B9 P=0.000001 C=-1.000000 (There's also the out-of-line, slow locking code for inflated locks, at labels B8 and B9, that you don't show.) HotSpot is just not coarsening the memory barrier operations that are necessary to preserve JMM semantics. Apparently HotSpot's lock optimizations are not sufficiently smart to risk messing with happens-before constraints. And because the barriers are not elided, HotSpot can't also perform additional opts like constant propagation etc.

mstanik -- sorry, that's just not true. You probably mean -XX:+PrintAssembly. As I wrote in this blog, all it takes for you to do this is a debug build of JDK.

Yep, I focused on the inc() example as you claimed that "JVM still fails to eliminate a lock here, despite the fact that this object doesn't escape the stack." You should be correct that the Vector.add() code is even worse because it's not inlined... this is an effect of callee-side locking. Suggestion: HotSpot could compile synchronized methods with a stub mechanism, i.e. a stub that does only the synchronization and calls the main code, so this main code could be invoked directly from callers that can optimize out the locking but cannot inline. This would have minimal cost for invocations into the synchronized stub (an extra CALL and RET), but would help optimize other interesting scenarios, like recursive calls and reentrant synchronization (when a given class has two synchronized methods a and b, and a invokes b, again without inlining).

Done.

Sorry I'm late to the party.
I'm the architect of the -server JIT.
If you have any lingering questions, please re-post them and I'll try to answer them.
As for the prefetch of 256+n*64 - that is so the new allocations hit in cache. Allocation memory is typically "very old" - not touched for a long time (probably since the last GC cycle) and hence isn't in any cache layer. The prefetch during *this* allocation means that the *next* allocation runs faster.
Also, I can print Azul assembly for those who are interested.
Cliff

cliffc -- thanks for the clarification. That's kind of what I suspected, but good to get a confirmation. I'm sure Azul assembly post (or in fact any insights into any JVM) would fascinate many folks, especially if it's coming from you.

mstanik — we're getting there; see http://wikis.sun.com/display/HotSpotInternals/PrintAssembly .

art_ -- thank you for the suggestion. I do have SPARC systems, but I think I'll leave the exercise to someone else.

opinali -- I did say the lock coarsening happened with Foo.inc, but it's still doing a lock once, which it shouldn't have done, if the escape analysis was working as I expected.

The double-call approach you mention is interesting; maybe you should have a JVM :-)

Just a small suggestion if you will: it'd be interesting to see the assembly code produced for SPARC if you have access to such a machine and the Sun JVM on SPARC supports the same options. A lot of university's teach both Java and SPARC or MIPS assembly programming. SPARC assembly is also, IMHO, easier to read than x86. SPARC output would make this article a useful resource for many teachers, professors, and students.

Also -XX:+PrintCompilation will tell you when something gets JIT'd. The default -client compiler does much less optimization, and compiles fairly quickly, but certainly not run-once code. The -server compiler only compiles things that have executed or looped 10000 or more times, but it compiles with a much higher level of optimization. Cliff

Awesome post! As for the locking optimizations, Jeroen Borgers has written a very nice article asking the question, do these lock optimizations really work. It will be published on InfoQ as soon as I finish editing it 8^)

All this optimization stuff is quite nice and shows that Sun engineers have packed a great deal of intelligence in the Hotspot compiler. But a question remains: couldn't we get still better optimizations if we used an ahead of time compiler? or if Hostspot was able to save and re-use the compilation results between runs?

The non-temporal prefetches are indeed interesting. In the 256 array clearing example they are 64 bytes apart, which makes sense as that matches the width of a cache line. They are non-temporal, which is interesting as it means that when that line is evicted from the L1 data cache it will be sent directly to memory and not to L2. This means the compiler believes that that data will be written to in the near'ish future but not read from. The choice of memory locations is also puzzling as they would appear to be at the end of the array being allocated and beyond, unless I've got my math wrong?

genepi -- Isn't Java already supposed to run faster than C on many benchmarks? So in that sense, isn't that argument already pretty much decided? I do however wonder if there can be some ahead of the time compilation to improve the start up performance, possibly with later recompilation as HotSpot acquires more information.

OK, so now there are two of you. I'll see try the latest version then. I did this experiment at home, so I can swear that I downloaded it from a public website.

kohsuke: Are you able to reproduce the same features in new builds? I'm having the same problems as above users. Running java 1.6.0_10-beta-fastdebug-b23 in a Windows 32bit / Cygwin environment. Got a bit discouraged after reading this: http://blogs.tedneward.com/2008/04/06/The+Complexities+Of+Black+Boxes.aspx

All I needed to do was "java -XX:+PrintOptoAssembly -server -cp . Main" as explained in the post. Are you sure you are running the right version of JVM, as opposed to the one in your PATH?

Yeah, I changed the JAVA_HOME, and then alias'ed "java" to "JAVA_HOME/bin/java" - and writing "java -version" gave me the correct version (I wrote it in the last comment). Think I'm missing something?

Great post! But how did you get it to work? I downloaded "1.6.0_10-beta-fastdebug-b22" and I don't see the output you see. Was there a change from b14 to b22 with the PrintOptoAssembly option?

All I get is a hotspot.log file in an XML format, which doesn't contain any assembly code..

Interesting, I vaguely remember seeing some benchmarks showing that StringBuffer had no overhead in comparison to StringBuilder when that option was enabled. I didn't try it myself though.

ijuma -- Yes, I tried "-XX:+DoEscapeAnalysis" and "-XX:+EliminateLocks" but I didn't notice any difference in the compiled code.

Hi, Nice post. Regarding escape analysis, I _think_ you have to enable it by adding: -XX:+DoEscapeAnalysis Regards, Ismael

cool++

Unfortunately, the crucial option (-XX:+PrintOptoAssembly) is available only to sun guys. This option requires disassembler library which is not available outside sun (despite it is is based on gnu sources). For more than 1 year i am regularly checking this issue and there is still no visible progress.

seems i can't post 'less than' chars here, so to continue above post: Changing i less than 100 to 100000 worked. Thanks for a nice blogpost.

Ah ok. Didn't know there were size/time limitations. When trying to run the first "trivial" example above nothing happened. But changing "i<100" to "i<100000" did the trick. Thanks for the nice article. :)

I just tried "1.6.0_10-beta-fastdebug-b23" on Linux amd64 and it did work. Note that the assembly dump is produced only when the code is JIT-compiled, so you do have to run the program for some duration to see any output.

So for example if you just do: public class Main { public static void main(String[] args){ System.out.println("Hello"); } then you won't see anything. Try something like this instead: public class Main { public static void main(String[] args) { for(int i=0;i<100000;i++) bar(); } public static void bar() { for(int i=0;i<100000;i++) foo(); } public static void foo() {} }

Also, some people claimed that this option only works in the -server JVM, but that is not true. The only reason I used the -server switch is because I wanted to see what JVM is really capable of, and I've heard that there are several optimizations only available in the server JVM.