Skip to main content

When is Software faster than Hardware?

Posted by mlam on February 13, 2007 at 2:44 AM PST

I decided that I'll take a break from the bug fix track that I've been on, and have a little diversion to spice things up. I'll resume the bug fix (and JIT internals) discussion soon. For today, I would like to clarify a common misconception that hardware Java processors are faster than dynamic adaptive compilers / just-in-time compilers (i.e. JITs). I'll take you through some analysis to prove my point. The analysis will be based on examples from the phoneME Advanced VM for CDC (aka CVM), but this reasoning should apply to other VMs as well. Let's dive in ...

Hardware Acceleration

Hardware acceleration is a technique that is commonly employed to get better software performance in terms of speed. This approach has been successful with graphics, sound, and DSP processing. In those cases, the hardware acceleration offloads the graphics, sound, and DSP work onto co-processors and frees up the main CPU to do other stuff. This parallelism is one reason we get improved performance out of hardware accelerators.

Another reason is that the hardware accelerators can provided special instructions that can do work that is traditionally done by software routines. Of course, these special instructions are specific to the types of algorithm (i.e. graphics, sound, DSP) that uses them. Hence, if your application doesn't do much graphics, sound, and/or DSP, then such hardware accelerators won't be able to make your application run any faster.

Due to the known success of these hardware accelerators in their respective applications, we have come to generalize this success to think that all hardware acceleration will beat software solutions. In the case of Java processors in comparison to JITs, this generalization turns out to be untrue.

Java Processors

The Java VM specification comes with its own instruction set. Some of these instructions look a lot like those one would find in a typical CPU's instruction set. Hence, the idea is that by adding the Java VM's instruction set to a CPU's instruction decoder, one can improve the performance of Java code execution. This observation is valid. However, the misconception is that this hardware acceleration will also out-perform or even match VM JITs. In the case of modern JIT compilers, a hardware Java processor will be hard-pressed to beat the performance of a JIT. Note: In the following discussion, I will abbreviate the hardware Java processor simply as JPU for brevity.

Disclaimer: I am not commenting on the quality of any specific hardware Java processor implementations in the market, but merely looking at this issue from a purely theoretical viewpoint.

OK, now let's look at a specific example ...

a Simple Example

Here is a simple code fragment written in the Java programming language:

    a = a + b;  // Add 2 integers.

When compiled into Java bytecode, this fragment will look like this:

    iload_0     // load a onto Java stack.
    iload_1     // load b onto Java stack.
    iadd        // pop the values, add them,
                // and push the result back
                // onto the stack.
    istore_0    // store result into a.

Note: I am applying some simplifying assumptions in this case. These assumptions are that the integer variables a and b reside in the Java stack frame's local 0 and 1 slots respectively. I chose this simplification for 2 reasons:

  1. To keep the example from becoming too complex.
  2. To favor the hardware Java processor. Basically, the above kind of code is where JPUs perform the best.

the VM Interpreter

Here's what the VM interpreter does to execute the above code.

    // opc_iload_0:
    ldr   r4, [fp, #-140]       // get address of start of locals
    ldrb  r3, [r8, #1]!         // compute next pc part 1
    ldr   r2, [r4, #0]          // load local 0
    ldr   r1, [r9, r3, asl #2]  // compute next pc part 2
    str   r2, [rJSP], #4        // push value on stack
    mov   pc, r1                // branch to next pc
    ...

    // opc_iadd:
    sub   r4, rJSP, #8          // Compute stack location of args
    ldmia r4, {r2, r4}          // load args i.e. loads 2 words from memory
    add   r2, r2, r4            // do add
    ldr   r1, [r9, r3, asl #2]  // compute next pc
    str   r2, [rJSP, #-8]       // store result on stack
    sub   rJSP, rJSP, #4        // adjust stack pointer
    mov   pc, r1                // branch to next pc
    ...

    // opc_istore_0:
    ldrb  r3, [r8, #1]!         // compute next pc part 1
    ldr   r2, [rJSP, #-4]!      // pop value off stack
    ldr   r4, [fp, #-140]       // get address of start of locals
    ldr   r1, [r9, r3, asl #2]  // compute next pc part 2
    str   r2, [r4, #0]          // store into local 0
    mov   pc, r1                // branch to next pc
    ...

The above is attained by compiling CVM's interpreter loop (executejava_standard.c) into ARM assembly using gcc. I picked ARM only as an example instruction set. My intent is to have a real world example that we can do some timing analysis on. However, I replaced some of the register names with symbolic equivalent to make the assembly code easier to understand. rJSP is the register that contains the Java stack pointer.

In the above, I did not show the interpreter fragment for the iload_1 bytecode because it is essentially identical to that of the iload_0 bytecode except for a different offset being applied to the frame pointer (fp) to compute r4.

In the above, I also indicated that some instructions are computing the "next pc". By "next pc", I mean the instruction address of the interpreter fragment that is needed to execute the next bytecode. Hence, the comment "branch to next pc" actually means "go execute next bytecode".

Therefore, in order to execute the a = a + b statement above using the VM interpreter:

Number of instructions:

6 (iload_0) + 6 (iload_1) + 7 (iadd) + 6 (istore_0) = 25 instructions

Number of memory accesses:

5 (iload_0) + 5 (iload_1) + 4 (iadd) + 5 (istore_0) = 19 memory accesses

For memory accesses, I only counted the load and store instructions, and discounted the branches. The ldmia counts as 2 loads because it loads 2 words of memory.

Based on the StrongARM 1110 processor, under optimal conditions where all accessed memory is already in cache, each of the instructions will take at least one machine cycle (if I remember correctly). If there is a cache miss, memory accesses can take up to 4 machine cycles assuming no virtual memory paging stalls. Otherwise, it'll take a lot more time. Note: the ldmia instruction takes 2 cycles because it counts as 2 loads. So, under ideal conditions,

Number of machine cycles (no cache miss):

1 + 5*1 (iload_0) + 1 + 5*1 (iload_1) + 4 + 4*1 (iadd) + 1 + 5*1 (istore_0) = 26 machine cycles

For contrast, let's say all the memory accesses are taking 4 machine cycles to complete. In this less ideal case,

Number of machine cycles (with cache miss):

1 + 5*4 (iload_0) + 1 + 5*4 (iload_1) + 4 + 4*4 (iadd) + 1 + 5*4 (istore_0) = 83 machine cycles

the Hardware Java Processor

For a hardware processor, the instructions being executed are the Java bytecodes themselves. However, the number of machine cycles to execute them can be estimated based on the interpreter.

For a JPU, there is no interpreter loop. Hence, there is no "next pc" to compute. The next pc is the next bytecode instruction which is pointed to by the JPU program counter register.

I'm also assuming (to the JPU's advantage) that it has a dedicated register which points to the start of the current Java stack frame's locals. With this, the work to get the address of the locals can be eliminated.

What remains are the essential operations that a JPU must still execute (even if it is behind the scenes). In the interpreter fragments above, these essential operations are expressed by the bolded lines of asm code.
The numbers for the JPU are:

Number of "ARM instructions":

2 (iload_0) + 2 (iload_1) + 3 (iadd) + 2 (istore_0) = 9 instructions

Number of memory accesses:

2 (iload_0) + 2 (iload_1) + 3 (iadd) + 2 (istore_0) = 9 memory accesses

Number of machine cycles (no cache miss):

2*1 (iload_0) + 2*1 (iload_1) + 1 + 3*1 (iadd) + 2*1 (istore_0) = 10 machine cycles

Number of machine cycles (with cache miss):

2*4 (iload_0) + 2*4 (iload_1) + 1 + 3*4 (iadd) + 2*4 (istore_0) = 37 machine cycles

Note: the JPU doesn't translate the bytecode into equivalent ARM instructions before executing them. But internally, the JPU still needs to do some equivalent quantity of work. I'm using the ARM instruction as a realistic approximation of the work that is done by the JPU in hardware.

Note also that the ldmia instruction is effectively equivalent to 2 load instructions. Please take this into account if you want to check my math above.

the JIT

For a JIT, the a = a + b statement will compile into the following ARM instructions:

    ldr   r7, [rJFP, #-140]   // load local 0
    ldr   r8, [rJFP, #-136]   // load local 1
    add   r7, r7, r8          // do add
    str   r7, [rJFP, #-140]   // store result into local 0

rJFP is a register dedicated by the JIT to contain the Java frame pointer.

In practice, the last store instruction may be eliminated depending on what bytecodes comes after it. Also sometimes, the values of local 0 and 1 may already be in registers depending on what bytecodes preceed this code fragment. In that case, the 2 load instructions may be eliminated as well. Therefore, under ideal JIT conditions, the a = a + b gets compiled by the JIT into a single ARM add instruction. But to be conservative in favor of the JPU case, I'll include the load stores for our comparison. Hence, the JIT numbers are:

Number of instructions: 4 instructions

Number of memory accesses: 3 memory accesses

Number of machine cycles (no cache miss): 1 + 3*1 = 4 machine cycles

Number of machine cycles (with cache miss): 1 + 3*4 = 13 machine cycles

For the ideal JIT case where we get to compile the statement into only a single ARM add instruction:

Number of instructions: 1 instruction

Number of memory accesses: 0 memory access

Number of machine cycles (no cache miss): 1 machine cycle

Number of machine cycles (with cache miss): 1 machine cycle

Comparing the Numbers

Here are the timing numbers next to each other:

Machine Cycles

Interpreter

JPU

JIT

ideal JIT

no cache miss

26

10 (2.60x)

4 (6.50x)

1 (26.0x)

with cache miss

83

37 (2.24x)

13 (6.38x)

1 (83.0x)

The x numbers in parentheses indicate the speed-up factor over interpreter speed. Based on this trivial example, the JIT out-performs the JPU by a good margin. The reason for this is because the JIT is able to eliminate some or all the memory accesses by keeping the working data in registers. In contrast, the JPU is constraint by the VM specification's stack machine architecture which incurs a lot of memory accesses.

But before you go crazy over these numbers, let me inject some sanity. In real life, you will probably not see the theoretical 26x or 83x speed-up for the ideal JIT case above. This is because the JIT won't always be able to keep all working data in registers because there is only a finite set of registers. This limitation is called register pressure. It is possible sometimes, but not in all cases. Hence, the real world numbers for the JIT are a lot closer to the non-ideal case than the ideal case.

To be fair to the JPU, to get around the stack architecture problem, a JPU would probably implement a stack caching scheme where some registers will mirror some top N values on the Java operand stack. It will also have some registers mirror some locals. This effectively allows the JPU to avoid memory accesses just like the ideal JIT case, thereby allowing it to theoretically attain speeds approaching the ideal JIT case. However, register pressure also limits the JPU just as it limits the JIT.

In addition, the JPU can only mirror the top most operands on the stack, and maybe the first few locals. Which stack operand and/or locals get mirrored in registers usually isn't configurable at runtime. The JIT however can deal with this dynamically, and can arbitrarily choose which arguments and/or locals to keep in registers. Hence, the JIT has a higher probability of reducing memory accesses than the JPU.

Complex Bytecodes

But wait, that isn't the whole story. I said earlier that the above simple example was chosen because it shows off the JPU's best performance. Note that even when the JPU is best at its game, the JIT will still tend to beat its performance. Now consider the complex bytecodes in the VM instruction set (click here for a glossary of the VM bytecodes).

The VM instruction set also consists of bytecodes like new (for allocating objects), newarray (for allocating arrays), monitorenter (for locking an object), monitorexit (for unlocking an object), and field accessors like getfield, getstatic, putfield, putstatic, and many other complex bytecodes.

I call these bytecodes "complex" because their execution cannot be done in hardware. For example, new needs to work correctly in harmony with the VM's garbage collector; monitorenter needs to work in harmony with the VM's synchronization and threading mechanisms, etc. In the case of the field accessors, there bytecodes have to interface with the class' constant pool table which can be implemented differently depending on the software VM implementation. The constant pool entries will also need to be resolved (amongst other things) before the fields can be accessed.

The only way the JPU can support these bytecodes is to trap / call out to software handler routines. Note that in a Java program, the use of these bytecodes are actually quite common. This means that when executing real world programs, the JPU won't be able to realize its optimal speed-up potential. Real world programs doesn't just sit in tight loops doing simple arithmetic and logic operations (which is when the JPU gets to perform its best).

Some JPUs will have support for quickened bytecodes that may accelerate the field accessors once all the difficult stuff has been done by software. However, there are some issues with how effective hardware quickened bytecodes can be.

But let's assume that the JPU has all the ideal parts that can allow it to get close to the JIT's performance. After all, the JIT also needs to do allocation and synchronization in software the same way the JPU does, right? Technically, yes. However, the JIT may have a slight advantage where it can make use of special knowledge about how the VM components (e.g. garbage collector, synchronization) works. This allows the JIT to be tuned to work optimally with these components. The JPU can't do this. It simply isn't an option. Usually, the JPU has to remain somewhat generic because it needs to cater to different garbage collector and synchronization algorithms. If the VM changes the garbage collector algorithm, it is a lot easier to update the JIT to work with it. The JPU on, the other hand, will require new silicon if it was specialized to a specific algorithm. This is why it must remain generic.

the Clincher

Another set of popular bytecodes that is used in Java code is the invoke bytecodes used for method invocation. The JPU must invoke these bytecodes by trapping to software handlers. The software handlers will do work like looking up the method to dispatch to (think virtual invokes), and pushing a frame for the method. For synhronized methods, the handlers need to do synchronization as well. Hence, for invocations, the JPU basically execute at the speed of the interpreter.

The JIT on the other hand can apply optimization techniques like partial frame initialization, and most significantly inlining. Inlining is one of the most significant optimizations that a JIT can do. It accounts for a great portion of the speed-up that is attained by the JIT. Inlining effectively eliminates the need for frame setup and tear-down. In some cases, it eliminates the method lookup also.

The number of instructions for method invocation overhead (method lookup, frame setup) can range from 25 instructions (with about 14 memory accesses) to 100s of instructions (and a lot more memory accesses). Note: these number are just educated guesses. I didn't actually count the instructions manually as I did with the earlier examples. However, I think they are in the correct ballpark.

With inlining, the JIT can avoid all this cost (i.e. incur 0 machine cycles for invocations of hot methods). The JPU has no such option because it cannot inline Java code (i.e. incur 10s to 100s of machine cycles for each invocation). And this is the main reason, the JIT will tend to out-perform the JPU by significant margins when executing real world programs.

Why use Hardware Execution?

Regardless, the JPU does have value. While it cannot compete with a JIT in terms of performance, it will perform much better than a software interpreter. Another reason to use the JPU is for extremely memory constraint environments where you have absolutely no memory budget to spare. The JIT adds 10s to 100s of kilobytes of memory consumption. If this is not a cost that can be tolerated, a JPU may be the best solution for performance. Note also that the JPU executes Java bytecode which is extremely compact. This means that code footprint is also less.

The last reason you may want to use a JPU is if you want to avoid compilation time which can be incurred by a JIT. This is a valid concern. However, in practice, people tend to blame slow startup on JIT compilation when, often times, it is actually due to many other factors. JIT compilation time usually contributes very little to the startup time. If I were to venture a guess as to how much, I would say less than 5% to 10% of startup on systems with very slow processors (though there can always be exceptions). If this is not tolerable, then a JPU may be worth considering.

Final Thoughts

There's always the possibilty that someone has invented a JPU with some advanced technology that I can't even imagine. I certainly can't speak for those cases. However, the above is realistic based on what I understand of common CPU technology and architectures today. Besides the JPU, other forms of hardware acceleration for Java VMs may include acceleration for memory accesses and lookups. These will probably yield greater returns than acceleration for bytecode execution.

And that concludes this entry. Till the next one, have a nice day. :-)


Edit: I've fixed up some typos and provided some additional clarification above (thanks to feedback from a colleague). Here are some additional thoughts from my colleage that I think are worthwhile mentioning here:

"the JPU can only mirror the top most operands on the stack," - yes, I wonder about this, it's got to load and spill them a lot so I wonder how much of a win this actually is.

And of course the JIT can do other fancy optimizations as well ...

Regarding the JPU mirroring the top most operands of the stack, to be effective, the JPU will need to prevent eager loading and spilling (between mirror registers and memory). Eager loading/spilling effectively gives the same performance as not having any register mirroring at all. Instead, loading/spilling should be done lazily (i.e. on demand when needed). At worst, this yields the same performance as no mirroring. At best, memory access can be avoided assuming the operand stack usage remains within the capacity of the mirror registers. Of course, all this needs to be handled automatically by the JPU hardware. Again, the JIT has an advantage here as it can choose more intelligently what memory content to mirror in registers.

Regarding other JIT optimizations, yes I neglected to mention them. The inlining I did mention is definitely one of the biggest, if not the biggest. Others that the JPU doesn't have the option to apply includes: constant folding, loop unrolling, loop invariant hoisting, common subexpression elimination, use of intrinsic methods, etc. I'm not saying that all these optimizations are implemented in all JITs. However, JITs do have the option to implement them, but JPUs do not.

Related Topics >>