|
|
||||||||||||||||
Mark Lam's BlogWhen is Software faster than Hardware?Posted by mlam on February 13, 2007 at 02:44 AM | Comments (6)I decided that I'll take a break from the bug fix track that I've been on, and have a little diversion to spice things up. I'll resume the bug fix (and JIT internals) discussion soon. For today, I would like to clarify a common misconception that hardware Java processors are faster than dynamic adaptive compilers / just-in-time compilers (i.e. JITs). I'll take you through some analysis to prove my point. The analysis will be based on examples from the phoneME Advanced VM for CDC (aka CVM), but this reasoning should apply to other VMs as well. Let's dive in ... Hardware Acceleration Another reason is that the hardware accelerators can provided special instructions that can do work that is traditionally done by software routines. Of course, these special instructions are specific to the types of algorithm (i.e. graphics, sound, DSP) that uses them. Hence, if your application doesn't do much graphics, sound, and/or DSP, then such hardware accelerators won't be able to make your application run any faster. Due to the known success of these hardware accelerators in their respective applications, we have come to generalize this success to think that all hardware acceleration will beat software solutions. In the case of Java processors in comparison to JITs, this generalization turns out to be untrue. Java Processors Disclaimer: I am not commenting on the quality of any specific hardware Java processor implementations in the market, but merely looking at this issue from a purely theoretical viewpoint. OK, now let's look at a specific example ... a Simple Example
a = a + b; // Add 2 integers.
When compiled into Java bytecode, this fragment will look like this:
iload_0 // load a onto Java stack.
iload_1 // load b onto Java stack.
iadd // pop the values, add them,
// and push the result back
// onto the stack.
istore_0 // store result into a.
Note: I am applying some simplifying assumptions in this case. These assumptions are that the integer variables a and b reside in the Java stack frame's local 0 and 1 slots respectively. I chose this simplification for 2 reasons:
the VM Interpreter
// opc_iload_0:
ldr r4, [fp, #-140] // get address of start of locals
ldrb r3, [r8, #1]! // compute next pc part 1
ldr r2, [r4, #0] // load local 0
ldr r1, [r9, r3, asl #2] // compute next pc part 2
str r2, [rJSP], #4 // push value on stack
mov pc, r1 // branch to next pc
...
// opc_iadd:
sub r4, rJSP, #8 // Compute stack location of args
ldmia r4, {r2, r4} // load args i.e. loads 2 words from memory
add r2, r2, r4 // do add
ldr r1, [r9, r3, asl #2] // compute next pc
str r2, [rJSP, #-8] // store result on stack
sub rJSP, rJSP, #4 // adjust stack pointer
mov pc, r1 // branch to next pc
...
// opc_istore_0:
ldrb r3, [r8, #1]! // compute next pc part 1
ldr r2, [rJSP, #-4]! // pop value off stack
ldr r4, [fp, #-140] // get address of start of locals
ldr r1, [r9, r3, asl #2] // compute next pc part 2
str r2, [r4, #0] // store into local 0
mov pc, r1 // branch to next pc
...
The above is attained by compiling CVM's interpreter loop (executejava_standard.c) into ARM assembly using gcc. I picked ARM only as an example instruction set. My intent is to have a real world example that we can do some timing analysis on. However, I replaced some of the register names with symbolic equivalent to make the assembly code easier to understand. rJSP is the register that contains the Java stack pointer. In the above, I did not show the interpreter fragment for the iload_1 bytecode because it is essentially identical to that of the iload_0 bytecode except for a different offset being applied to the frame pointer (fp) to compute r4. In the above, I also indicated that some instructions are computing the "next pc". By "next pc", I mean the instruction address of the interpreter fragment that is needed to execute the next bytecode. Hence, the comment "branch to next pc" actually means "go execute next bytecode". Therefore, in order to execute the a = a + b statement above using the VM interpreter:
For memory accesses, I only counted the load and store instructions, and discounted the branches. The ldmia counts as 2 loads because it loads 2 words of memory. Based on the StrongARM 1110 processor, under optimal conditions where all accessed memory is already in cache, each of the instructions will take at least one machine cycle (if I remember correctly). If there is a cache miss, memory accesses can take up to 4 machine cycles assuming no virtual memory paging stalls. Otherwise, it'll take a lot more time. Note: the ldmia instruction takes 2 cycles because it counts as 2 loads. So, under ideal conditions,
For contrast, let's say all the memory accesses are taking 4 machine cycles to complete. In this less ideal case,
the Hardware Java Processor For a JPU, there is no interpreter loop. Hence, there is no "next pc" to compute. The next pc is the next bytecode instruction which is pointed to by the JPU program counter register. I'm also assuming (to the JPU's advantage) that it has a dedicated register which points to the start of the current Java stack frame's locals. With this, the work to get the address of the locals can be eliminated. What remains are the essential operations that a JPU must still execute (even if it is behind the scenes). In the interpreter fragments above, these essential operations are expressed by the bolded lines of asm code. The numbers for the JPU are:
Note: the JPU doesn't translate the bytecode into equivalent ARM instructions before executing them. But internally, the JPU still needs to do some equivalent quantity of work. I'm using the ARM instruction as a realistic approximation of the work that is done by the JPU in hardware. Note also that the ldmia instruction is effectively equivalent to 2 load instructions. Please take this into account if you want to check my math above. the JIT
ldr r7, [rJFP, #-140] // load local 0
ldr r8, [rJFP, #-136] // load local 1
add r7, r7, r8 // do add
str r7, [rJFP, #-140] // store result into local 0
rJFP is a register dedicated by the JIT to contain the Java frame pointer. In practice, the last store instruction may be eliminated depending on what bytecodes comes after it. Also sometimes, the values of local 0 and 1 may already be in registers depending on what bytecodes preceed this code fragment. In that case, the 2 load instructions may be eliminated as well. Therefore, under ideal JIT conditions, the a = a + b gets compiled by the JIT into a single ARM add instruction. But to be conservative in favor of the JPU case, I'll include the load stores for our comparison. Hence, the JIT numbers are:
For the ideal JIT case where we get to compile the statement into only a single ARM add instruction:
Comparing the Numbers
The x numbers in parentheses indicate the speed-up factor over interpreter speed. Based on this trivial example, the JIT out-performs the JPU by a good margin. The reason for this is because the JIT is able to eliminate some or all the memory accesses by keeping the working data in registers. In contrast, the JPU is constraint by the VM specification's stack machine architecture which incurs a lot of memory accesses. But before you go crazy over these numbers, let me inject some sanity. In real life, you will probably not see the theoretical 26x or 83x speed-up for the ideal JIT case above. This is because the JIT won't always be able to keep all working data in registers because there is only a finite set of registers. This limitation is called register pressure. It is possible sometimes, but not in all cases. Hence, the real world numbers for the JIT are a lot closer to the non-ideal case than the ideal case. To be fair to the JPU, to get around the stack architecture problem, a JPU would probably implement a stack caching scheme where some registers will mirror some top N values on the Java operand stack. It will also have some registers mirror some locals. This effectively allows the JPU to avoid memory accesses just like the ideal JIT case, thereby allowing it to theoretically attain speeds approaching the ideal JIT case. However, register pressure also limits the JPU just as it limits the JIT. In addition, the JPU can only mirror the top most operands on the stack, and maybe the first few locals. Which stack operand and/or locals get mirrored in registers usually isn't configurable at runtime. The JIT however can deal with this dynamically, and can arbitrarily choose which arguments and/or locals to keep in registers. Hence, the JIT has a higher probability of reducing memory accesses than the JPU. Complex Bytecodes The VM instruction set also consists of bytecodes like new (for allocating objects), newarray (for allocating arrays), monitorenter (for locking an object), monitorexit (for unlocking an object), and field accessors like getfield, getstatic, putfield, putstatic, and many other complex bytecodes. I call these bytecodes "complex" because their execution cannot be done in hardware. For example, new needs to work correctly in harmony with the VM's garbage collector; monitorenter needs to work in harmony with the VM's synchronization and threading mechanisms, etc. In the case of the field accessors, there bytecodes have to interface with the class' constant pool table which can be implemented differently depending on the software VM implementation. The constant pool entries will also need to be resolved (amongst other things) before the fields can be accessed. The only way the JPU can support these bytecodes is to trap / call out to software handler routines. Note that in a Java program, the use of these bytecodes are actually quite common. This means that when executing real world programs, the JPU won't be able to realize its optimal speed-up potential. Real world programs doesn't just sit in tight loops doing simple arithmetic and logic operations (which is when the JPU gets to perform its best). Some JPUs will have support for quickened bytecodes that may accelerate the field accessors once all the difficult stuff has been done by software. However, there are some issues with how effective hardware quickened bytecodes can be. But let's assume that the JPU has all the ideal parts that can allow it to get close to the JIT's performance. After all, the JIT also needs to do allocation and synchronization in software the same way the JPU does, right? Technically, yes. However, the JIT may have a slight advantage where it can make use of special knowledge about how the VM components (e.g. garbage collector, synchronization) works. This allows the JIT to be tuned to work optimally with these components. The JPU can't do this. It simply isn't an option. Usually, the JPU has to remain somewhat generic because it needs to cater to different garbage collector and synchronization algorithms. If the VM changes the garbage collector algorithm, it is a lot easier to update the JIT to work with it. The JPU on, the other hand, will require new silicon if it was specialized to a specific algorithm. This is why it must remain generic. the Clincher The JIT on the other hand can apply optimization techniques like partial frame initialization, and most significantly inlining. Inlining is one of the most significant optimizations that a JIT can do. It accounts for a great portion of the speed-up that is attained by the JIT. Inlining effectively eliminates the need for frame setup and tear-down. In some cases, it eliminates the method lookup also. The number of instructions for method invocation overhead (method lookup, frame setup) can range from 25 instructions (with about 14 memory accesses) to 100s of instructions (and a lot more memory accesses). Note: these number are just educated guesses. I didn't actually count the instructions manually as I did with the earlier examples. However, I think they are in the correct ballpark. With inlining, the JIT can avoid all this cost (i.e. incur 0 machine cycles for invocations of hot methods). The JPU has no such option because it cannot inline Java code (i.e. incur 10s to 100s of machine cycles for each invocation). And this is the main reason, the JIT will tend to out-perform the JPU by significant margins when executing real world programs. Why use Hardware Execution? The last reason you may want to use a JPU is if you want to avoid compilation time which can be incurred by a JIT. This is a valid concern. However, in practice, people tend to blame slow startup on JIT compilation when, often times, it is actually due to many other factors. JIT compilation time usually contributes very little to the startup time. If I were to venture a guess as to how much, I would say less than 5% to 10% of startup on systems with very slow processors (though there can always be exceptions). If this is not tolerable, then a JPU may be worth considering. Final Thoughts And that concludes this entry. Till the next one, have a nice day. :-) Edit: I've fixed up some typos and provided some additional clarification above (thanks to feedback from a colleague). Here are some additional thoughts from my colleage that I think are worthwhile mentioning here:
Regarding the JPU mirroring the top most operands of the stack, to be effective, the JPU will need to prevent eager loading and spilling (between mirror registers and memory). Eager loading/spilling effectively gives the same performance as not having any register mirroring at all. Instead, loading/spilling should be done lazily (i.e. on demand when needed). At worst, this yields the same performance as no mirroring. At best, memory access can be avoided assuming the operand stack usage remains within the capacity of the mirror registers. Of course, all this needs to be handled automatically by the JPU hardware. Again, the JIT has an advantage here as it can choose more intelligently what memory content to mirror in registers. Regarding other JIT optimizations, yes I neglected to mention them. The inlining I did mention is definitely one of the biggest, if not the biggest. Others that the JPU doesn't have the option to apply includes: constant folding, loop unrolling, loop invariant hoisting, common subexpression elimination, use of intrinsic methods, etc. I'm not saying that all these optimizations are implemented in all JITs. However, JITs do have the option to implement them, but JPUs do not. Bookmark blog post: CommentsComments are listed in date ascending order (oldest first) | Post Comment
| ||||||||||||||||
|
|