Search |
||
Deep dive into assembly code from JavaPosted by kohsuke on March 30, 2008 at 10:10 PM PDT
One of the things I learned in The Server Side Java Symposium 2008 was a command-line option to print out the assembly code that JIT is producing. Since I've always been interested in seeing the final assembly code that gets produced from your Java code, I decided to give it a test drive. First the disclaimers:
The option in question is only available in debug builds of JDKs. You can download one from here. The binary I tested is JDK6 u10 b14. $ java -fullversion java full version "1.6.0_10-beta-fastdebug-b14" First, let's try something trivial:
I run this like "java -XX:+PrintOptoAssembly -server -cp . Main". The -XX:+PrintOptoAssembly is the magic option, and with this option I get the following, which shows the code of the "foo" method:
You see that the entire bar() function call and the loop was optimized away. So it must have inlined the bar() method, then unrolled the loop. Now to something more interesting:
This produces the following code:
Just to recap, R8-R15 are additional general-purpose 64bit registers new in the amd64. The first part (00c-027) is allocating an array, and this is already interesting. As the comment indicates, R15 is apparently used as a pointer to a thread-local storage of the current thread, and R15[120] is the pointer to the head of the heap sub-space dedicated for this thread. So the byte[] is allocated from this thread-local space by simply reserving 256+32 byte space. If there's not enough space (the limit is set at R15[136]), then it uses the slower allocation code at B15 — this code must involve in reserving a new chunk from the eden space and allocate a new object there. Once the pointer to the new array is set to R8 at 00c, the initialization follows (033-071.) The first 24 bytes of the newly allocated space is used for metadata (the first 8 byte is probably lock or GC-related, followed by a pointer to the class object, then another 8 bytes for the size of the array.) 06c zero-clears the array. In theory the zero-clear shouldn't have been necessary, as we are then filling the array to zero again, but JIT failed to take advantage of that. But note that the zero-clear is done by 8 bytes at a time, so it did recognize that the array size is multiple of 8. I don't quite understand what those prefetch instructions (at 02b, 03a, and 049) are meant for. Presumably they are to make sure that the next time an object allocation happens, that part of the memory is in cache, but why 256, 320, and 384? Does anyone have a clue? Now as of 074, R8 is the pointer to 'buf' and R9 is the length of the array. Note that JIT knows that buf.length is always 256 here, so this is movl R9,256 and not movl R9,[R8+16]. Also note that this computation is outside the for loop. So this tells us that there's no need to explicitly assign the array length to a temporary variable in a tight loop, because JIT does the equivalent anyway:
Similarly there's no need to reverse the direction of the loop to avoid buf.length computation. The way the loop is compiled is very interesting. First there's the 'warm up' part (07c-099) that presumably does the array filling until it reaches the 8-byte boundary, then the 'fast loop' portion (09b-0d3) that zero-fills 8 bytes per loop by using an MMX register, then the final 'cool down' part (0d5-100) that handles the last remaining part that doesn't fit 8 byte boundary. In this case, in theory it could have figured out that the whole thing nicely fits 8-byte boundary, so the warm up and cool down was unnecessary, but it appears that JIT didn't realize this. I don't know what kind of computation happens behind the scene here, but overall this loop unrolling is rather clever. The original code was byte-by-byte assignment to 0, but in the final code, one loop iteration clears 8 byte at a time. I also noticed that there's no array boundary check in the fast loop portion, which is nice. OK, most of you have hopefully heard that in JDK6 they do lock coarsening and lock elision. So let's see that in action. For that, I compiled the following code and executed in the same fashion:
This gives me the following:
The allocation of a Vector object (00c-058) is almost identical to the array allocation code we've seen before (except the additional field initializations at 048-058.) The array allocation for Vector.elementData follows (060-0C0.) Note that the Vector constructors are defined in highly nested fashion like this:
... but the whole thing is inlined, so the end result is just as fast as the following code. This is great.
But wait, after that, you see that there's three call instructions for Vector.add. So there's no lock elision nor lock coarsening, despite the fact that this Vector object never escapes the stack. I thought perhaps that's because Vector.add is too complex to be inlined, so I tried the following code, in the hope of seeing the lock elision: This produced the following code:
We are all familar with the memory allocation by now, so we can skip that. The 'fastlock' pseudo-instruction (AFAIK there's no such operation in amd64, and a single machine code can't possibly occupy 223 bytes!) must be the lock code. Here you see that the lock coarsening has indeed happened (yay!), and three increments happen in a single block (MEMBAR-acquire/release must be another pseudo-instruction, which became no-op in this scenario — see that the length of those instructions are 0). Note that JVM still fails to eliminate a lock here, despite the fact that this object doesn't escape the stack. I tried various things to see the effect of escape analysis and lock elision kick in, but couldn't find a way to do it. It looks like this feature is not quite in JDK yet, although it's equally possible that I'm doing something stupid. Also note that presumably because of the memory barrier associated with this, each increments write back to memory. This is unfortunate because in theory three increments could have been combined into one, given the the lock was coarsened. Indeed if I remove the 'synchronized' keyword, I get the following substantially simpler version:
So not only three inc()s but the initializer also got collapsed into single "movq rax[16],3" call. Wow! All in all, modern JVMs seem pretty good at generating optimal code. In various situations, the resulting assembly code is far from the straight-forward instruction-by-instruction translation. OTOH, the escape analysis doesn't really seem to do anything useful yet. This was a long post, but I hope you enjoyed this as much as I did. »
Related Topics >>
Java Tools Comments
Comments are listed in date ascending order (oldest first)
Submitted by kohsuke on Wed, 2008-05-14 08:19.
I just tried "1.6.0_10-beta-fastdebug-b23" on Linux amd64 and it did work. Note that the assembly dump is produced only when the code is JIT-compiled, so you do have to run the program for some duration to see any output.
So for example if you just do:
public class Main { public static void main(String[] args){ System.out.println("Hello"); }
then you won't see anything. Try something like this instead:
public class Main {
public static void main(String[] args) {
for(int i=0;i<100000;i++)
bar();
}
public static void bar() {
for(int i=0;i<100000;i++)
foo();
}
public static void foo() {}
}
Also, some people claimed that this option only works in the -server JVM, but that is not true. The only reason I used the -server switch is because I wanted to see what JVM is really capable of, and I've heard that there are several optimizations only available in the server JVM.
Submitted by etni3s on Wed, 2008-05-14 16:54.
Ah ok. Didn't know there were size/time limitations. When trying to run the first "trivial" example above nothing happened. But changing "i<100" to "i<100000" did the trick.
Thanks for the nice article. :)
Submitted by etni3s on Wed, 2008-05-14 16:56.
seems i can't post 'less than' chars here, so to continue above post:
Changing i less than 100 to 100000 worked.
Thanks for a nice blogpost.
Submitted by mstanik on Mon, 2008-03-31 22:41.
Unfortunately, the crucial option (-XX:+PrintOptoAssembly) is available only to sun guys. This option requires disassembler library which is not available outside sun (despite it is is based on gnu sources). For more than 1 year i am regularly checking this issue and there is still no visible progress.
Submitted by ijuma on Mon, 2008-03-31 01:52.
Hi,
Nice post. Regarding escape analysis, I _think_ you have to enable it by adding:
-XX:+DoEscapeAnalysis
Regards,
Ismael
Submitted by kohsuke on Mon, 2008-03-31 08:26.
ijuma — Yes, I tried "-XX:+DoEscapeAnalysis" and "-XX:+EliminateLocks" but I didn't notice any difference in the compiled code.
Submitted by ijuma on Mon, 2008-03-31 08:39.
Interesting, I vaguely remember seeing some benchmarks showing that StringBuffer had no overhead in comparison to StringBuilder when that option was enabled. I didn't try it myself though.
Submitted by aviadbd on Sun, 2008-05-11 05:57.
Great post! But how did you get it to work? I downloaded "1.6.0_10-beta-fastdebug-b22" and I don't see the output you see. Was there a change from b14 to b22 with the PrintOptoAssembly option?
All I get is a hotspot.log file in an XML format, which doesn't contain any assembly code..
Submitted by aviadbd on Tue, 2008-05-13 08:43.
Yeah, I changed the JAVA_HOME, and then alias'ed "java" to "JAVA_HOME/bin/java" - and writing "java -version" gave me the correct version (I wrote it in the last comment).
Think I'm missing something?
Submitted by kohsuke on Mon, 2008-05-12 09:53.
All I needed to do was "java -XX:+PrintOptoAssembly -server -cp . Main" as explained in the post. Are you sure you are running the right version of JVM, as opposed to the one in your PATH?
Submitted by etni3s on Tue, 2008-05-13 16:48.
kohsuke: Are you able to reproduce the same features in new builds?
I'm having the same problems as above users. Running java 1.6.0_10-beta-fastdebug-b23 in a Windows 32bit / Cygwin environment.
Got a bit discouraged after reading this:
http://blogs.tedneward.com/2008/04/06/The+Complexities+Of+Black+Boxes.aspx
Submitted by kohsuke on Tue, 2008-05-13 16:54.
OK, so now there are two of you. I'll see try the latest version then. I did this experiment at home, so I can swear that I downloaded it from a public website.
Submitted by kohsuke on Mon, 2008-04-07 15:57.
genepi — Isn't Java already supposed to run faster than C on many benchmarks? So in that sense, isn't that argument already pretty much decided? I do however wonder if there can be some ahead of the time compilation to improve the start up performance, possibly with later recompilation as HotSpot acquires more information.
Submitted by asymtote on Mon, 2008-04-07 16:25.
The non-temporal prefetches are indeed interesting. In the 256 array clearing example they are 64 bytes apart, which makes sense as that matches the width of a cache line. They are non-temporal, which is interesting as it means that when that line is evicted from the L1 data cache it will be sent directly to memory and not to L2. This means the compiler believes that that data will be written to in the near'ish future but not read from. The choice of memory locations is also puzzling as they would appear to be at the end of the array being allocated and beyond, unless I've got my math wrong?
Submitted by genepi on Mon, 2008-04-07 14:03.
All this optimization stuff is quite nice and shows that Sun engineers have packed a great deal of intelligence in the Hotspot compiler. But a question remains: couldn't we get still better optimizations if we used an ahead of time compiler? or if Hostspot was able to save and re-use the compilation results between runs?
Submitted by kcpeppe on Mon, 2008-04-07 23:27.
Awesome post!
As for the locking optimizations, Jeroen Borgers has written a very nice article asking the question, do these lock optimizations really work. It will be published on InfoQ as soon as I finish editing it 8^)
Submitted by cliffc on Thu, 2008-05-15 19:50.
Also -XX:+PrintCompilation will tell you when something gets JIT'd.
The default -client compiler does much less optimization, and compiles fairly quickly, but certainly not run-once code. The -server compiler only compiles things that have executed or looped 10000 or more times, but it compiles with a much higher level of optimization.
Cliff
Submitted by art_ on Fri, 2008-04-04 11:45.
Just a small suggestion if you will: it'd be interesting to see the assembly code produced for SPARC if you have access to such a machine and the Sun JVM on SPARC supports the same options. A lot of university's teach both Java and SPARC or MIPS assembly programming. SPARC assembly is also, IMHO, easier to read than x86. SPARC output would make this article a useful resource for many teachers, professors, and students.
Submitted by kohsuke on Fri, 2008-04-04 13:56.
opinali — I did say the lock coarsening happened with Foo.inc, but it's still doing a lock once, which it shouldn't have done, if the escape analysis was working as I expected.
The double-call approach you mention is interesting; maybe you should have a JVM :-)
Submitted by kohsuke on Fri, 2008-04-04 13:58.
art_ — thank you for the suggestion. I do have SPARC systems, but I think I'll leave the exercise to someone else.
Submitted by jrose on Sat, 2008-04-05 20:29.
mstanik — we're getting there; see http://wikis.sun.com/display/HotSpotInternals/PrintAssembly .
Submitted by kohsuke on Thu, 2008-05-01 09:28.
cliffc — thanks for the clarification. That's kind of what I suspected, but good to get a confirmation. I'm sure Azul assembly post (or in fact any insights into any JVM) would fascinate many folks, especially if it's coming from you.
Submitted by cliffc on Thu, 2008-05-01 08:20.
Sorry I'm late to the party.
I'm the architect of the -server JIT. If you have any lingering questions, please re-post them and I'll try to answer them. As for the prefetch of 256+n*64 - that is so the new allocations hit in cache. Allocation memory is typically "very old" - not touched for a long time (probably since the last GC cycle) and hence isn't in any cache layer. The prefetch during *this* allocation means that the *next* allocation runs faster. Also, I can print Azul assembly for those who are interested. Cliff
Submitted by opinali on Tue, 2008-04-01 18:12.
Yep, I focused on the inc() example as you claimed that "JVM still fails to eliminate a lock here, despite the fact that this object doesn't escape the stack." You should be correct that the Vector.add() code is even worse because it's not inlined... this is an effect of callee-side locking. Suggestion: HotSpot could compile synchronized methods with a stub mechanism, i.e. a stub that does only the synchronization and calls the main code, so this main code could be invoked directly from callers that can optimize out the locking but cannot inline. This would have minimal cost for invocations into the synchronized stub (an extra CALL and RET), but would help optimize other interesting scenarios, like recursive calls and reentrant synchronization (when a given class has two synchronized methods a and b, and a invokes b, again without inlining).
Submitted by kohsuke on Tue, 2008-04-01 07:09.
mstanik — sorry, that's just not true. You probably mean -XX:+PrintAssembly. As I wrote in this blog, all it takes for you to do this is a debug build of JDK.
Submitted by opinali on Tue, 2008-04-01 12:05.
I'm also not a HotSpot expert, but it seems to me that lock coarseing is working for your foo() method. Your Assembly listing shows that the lock/unlock operations happen only once:
051 leaq R11, [rsp + #64] # box lock
056 fastlock RBP,R11,RAX,R10
135 jne B8 P=0.000001 C=-1.000000
... three copies of the foo.inc() code...
15c leaq RAX, [rsp + #64] # box lock
161 fastunlock RBP, RAX, R10
218 jne,s B9 P=0.000001 C=-1.000000
(There's also the out-of-line, slow locking code for inflated locks, at labels B8 and B9, that you don't show.) HotSpot is just not coarsening the memory barrier operations that are necessary to preserve JMM semantics. Apparently HotSpot's lock optimizations are not sufficiently smart to risk messing with happens-before constraints. And because the barriers are not elided, HotSpot can't also perform additional opts like constant propagation etc.
Submitted by kohsuke on Tue, 2008-04-01 12:11.
opinali — yes, but that's with Foo.inc() method. Where I was complaining, I was complaining about the lack of lock coarsening in Vector.add.
So I guess the best we can say is that the lock coarsening does work in some situations.
Submitted by opinali on Tue, 2008-04-01 12:13.
kohsuke, sorry the bad post - I must have missed/typo'ed the ending PRE tag, didn't use the preview... can you fix that?
|
||
|
|