Skip to main content

Software Territory: Where Hardware can't go!

Posted by mlam on February 16, 2007 at 2:27 AM PST

In response to my previous article, some folks have been asking about the JIT optimizations I listed, as well as a lot of other interesting questions. I'm not sure I can address all of the questions here. But on the topic of JIT optimizations, I can provide more insight on what they are as well as why hardware cannot implement them.

Before I get started, just to be clear, I'm not personally against hardware Java processors. I certainly think that they fit nicely in some domains. I am also not against any vendors who make Java processors out there. I applaud them for serving the needs of a market that a JIT may not fit. Also, just because a JIT fits doesn't mean that it is always the best solution to deploy. In a previous article, I've made the case that engineering decisions should always be made on a case by case basis. A "one size fits all" mentality can work, but may not always yield the best solution.

However, I do want to debunk the myth that a hardware processor can be faster than an optimizing JIT. But, of course, the JIT isn't free. There is some cost to it in terms of CPU cycles and memory, though it is often a lot less than most people believe. I will address the JIT cost issue in a future article. For today, let's look at JIT optimizations. Since I work on the phoneME Advanced VM for CDC (aka CVM), along the way, I'll point out if these optimizations are available in CVM as it exists today (for those who are interested in CVM details).

Resources: When is Software faster than Hardware?

JIT Optimizations

In my last entry, I rambled off a random list of JIT compiler optimizations. The list is by no means comprehensive nor necessarily indicative of the most desirable optimizations to have in a JIT. Previously, I have explained how more performance isn't always a good thing. Each optimization comes with a cost of some sort. The VM/JIT engineer must weigh the cost against the benefits in choosing to include or leave out an optimization. That said, let's go over the optimizations I've already mentioned as examples to illustrate why a JIT has the advantage over Java processors when performance is the criteria of comparison.

The list again is:

  1. inlining
  2. constant folding
  3. loop unrolling
  4. loop invariant hoisting
  5. common subexpression elimination
  6. use of intrinsic methods

Inlining

Consider this example:

    public class MyProperty {
        protected int value;
        public int getValue() {
            return value;
        }
    }

    public class User {
        public void doStuff(MyProperty p) {
            System.out.println(p.getValue());
        }
    }

This example shows a common coding pattern in the Java programming language i.e. the use of getter/setter methods to access private data. This is done to achieve better encapsulation. We use getter methods like getValue() because accessing fields like value directly would introduce a whole slew of software engineering problems which I won't go into here.

While using a getter method is good for encapsulation, it is bad for performance because you will have to incur the cost of a method call. The cost of a method call includes pushing arguments (e.g. the this pointer), setting up and tearing down a stack frame for the target method (getValue() in this case), and popping the return value of the stack. Inside the target method, there's also the added cost of more pushing and popping of operands and results. In this trivial example, we only need to push the result inside getValue(). In a more complex example, there can be other costs not shown here. This cost adds up to somewhere between 10s to 100+ machine instructions. Note that these instructions are all method overhead. The getValue() method will still has to do the real work of accessing the field which can take as little as 2 machine instructions.

To deal with this, when compiling doStuff(), a JIT compiler would inline the call to getValue() to effectively get the following code:

        public void doStuff(MyProperty p) {
            System.out.println(p.value);
        }

In so doing, you still get the benefits of encapsulation (in the source code, at development time) for good software engineering practice, but still get optimal performance (at runtime) as if you accessed the field directly. All the cost of method invocation is removed. The access to p.value takes less than 10 instructions. Compare this with the extra 10s to 100+ instructions to do the method invocation.

Note that I said earlier that the getValue() could do the work of accessing the field in as little as 2 instructions. But in the inlined case, I said it will take less than 10 instructions. Why the discrepancy?

Well, getValue() is a virtual method. Hence, there may be some added cost to check if we're actually going to end up invoking MyProperty.getValue() as opposed to an overriding method in a subclass. This is the reason for the 10 or so instruction estimate. However, in the case where this method is not overridden, the JIT can truly optimize this down to the minimum 2 instructions.

I pointed out the added complexity with dealing with virtual methods because I want you to understand that there is more to doing inlining correctly than meets the eye. There are many other details to the implementation of inlining that I can't go into here.

Hardware Method Invocation

Now, let's consider the Java processor (JPU). When executing doStuff(), the JPU will encounter an invokevirtual bytecode where it tries to call getValue(). By definition, the JPU will treat the invokevirtual bytecode as its machine instruction and execute it. However, the JPUs won't know how a VM structure its stack. Hence, it will need to trap to software to do all the work that I pointed out above as overhead.

One might argue that a really advanced JPU will define the stack structure and the VM software will just have to conform to that so that hardware will know how to push and pop a frame itself. But even without the stack issues, there are a few other things that make it really hard for the JPU to do a method invocation purely in hardware.

For one, the invokevirtual bytecode specifies an index into the class constant pool (CP). The JPU will also need to be able to understand the structure of the CP. But the class constant pool has symbolic references to the method to be invoked. This will need to be resolved first. Resolution will trigger class lookup. In the case of invoking static methods, resolution can trigger classloading, class initialization, garbage collection, and exceptions being thrown. As you can see, invoking a method is not a trivial thing. It would take a seriously advanced and extremely complex JPU to do method invocations in hardware.

Note, you don't actually have to do classloading, garbage collection, etc. in hardware in order to do method invocations in hardware. You just need to be able to find some way to trap to these when the hardware can't handle it. If the JPU can just execute the common invocation cases in hardware (and leave the rest to software), then that's a big win. However, in order to achieve this, it will require that in addition to having to specify the stack structure, the JPU will at least also have to specify a constant pool structure that the hardware can understand.

Using Miraculous Hardware

Now, let's grant you that the hardware designer is relentless and gives you all that. With that, the JPU will still have to execute the method invocation which involves all the overhead I pointed out. Executing it in hardware doesn't mean that the overhead is gone. The work done in the overhead incurs a lot of memory accesses. What is the chance that you will never have a cache miss? And if you have a large enough cache to make cache misses improbable, then what would it take to be able to move multiple words of data (for the arguments, stack frame values, and result) around the cache without incurring multiple machine cycles? Chances are, the number of cycles incurred by the JPU will be non-zero. Now compare that with the JIT where that cost can be 0. There's no beating inlining when it comes to performance.

If you're still an optimist for the JPU, the next thing you may ask is if we can have the JPU do inlining too. But remember what I said about having to do a check in some cases when we're dealing with inlining virtual method calls (not to mention the other complexities that I did not talk about)? It will be a whole lot of extra work to be able to handle all those cases in hardware.

Yes, theoretically, anything one can do in software, you can also do in hardware. But the difficulty of doing it in hardware is significantly more difficult and costly (in terms of hardware design, manufacturing, etc.) compared to a software solution. So, a real world JPU would probably trap to software to do method invocation. At best, it can do something to help the software do less work, but not reduce the work to 0 as a JIT can in this case.

Inlining is available in the CVM JIT.

Constant Folding

Consider this example:

    public class O1 {
        public static final int OFFSET = 5;
    }

    public class O2 {
        public static final int OFFSET = 3;
    }

    public class MyClass {
        int calcValue(int v1, int v2) {
            return (v1 + O1.OFFSET) + (v2 + O2.OFFSET);
        }
    }

The JIT can effectively compile calcValue() into:

        int calcValue(int v1, int v2) {
            return (v1 + v2 + 8);
        }

Constant folding is basically an optimization where we fold the constants together to reduce the amount of work that needs to be done to compute a result. In this case, the JIT takes advantage of the algebraic properties of addition and pre-add the 2 constants together instead of having to do it every time this method is called. Hence, only 2 add operations are needed when the method is called.

A JPU by definition will execute its instructions which are the bytecodes. In this case, the bytecodes for the constants will include pushing 2 constants, and doing 3 additions. With the possible hardware feature where the top N operands of the stack are mirrored in registers, the JPU can avoid some of the cost of the pushing and popping cost. However, it still needs to initialize the values of those registers. Compared to the JIT, the JPU will incur these additional register initialization cost plus one extra addition. The JIT can not only eliminate the add, but also encode the constant (in this case, the value 8), if it is not too big, into the one of the add instructions. This allows it to avoid the register initialization altogether.

OK, you may ask: won't javac be smart enough to do the constant folding when the Java source code is compiled into bytecode? Maybe. I didn't check. In practice, constant folding usually becomes more meaningful when used in conjunction with inlining. Inlining may yield opportunities for constant folding that don't exist at the source level. For example:

    int adjustValue(int value) {
        return value + 5;
    }

    int adjustMore(int value) {
        return adjustValue(value) + 3;
    }

After inlining adjustValue() into adjustMore(), the JIT can also fold the constants as follows:

    int adjustMore(int value) {
        return value + 8;
    }

Some types of constant folding is available in the CVM JIT. In practice, constant folding has not yielded a lot of performance gains in real world benchmarks. Hence, accordingly, we didn't put a lot of effort into implementing every possible type.

Loop Unrolling

Consider this example:

    int a = ... // some value.
    for (int i = 0; i < 3; i++) {
        a = a + i;
    }

The anatomy of the above loop includes the following operation:

    
1. initialize the iterator i to 0.

    
2. check to see if the iterator has exceeded the limit (i.e. 3).

    
3. execute the addition within the loop.

    
4. increment the iterator.

    
5. branch back to the top of the loop.

Again, by definition, a JPU will execute the bytecode as its own native instruction set. Since the bytecode basically expresses the above operations, the JPU will execute the above steps 3 times.

With loop unrolling, the JIT can compile the above code fragment into the following:

    int a = ... // some value.
    int i = 0;
    a = a + i;
    i++;
    a = a + i;
    i++;
    a = a + i;

With a little extra smarts, the JIT can further optimize the above to:

    int a = ... // some value.
    a = a + 0;  // i is 0.
    a = a + 1;  // i is 1.
    a = a + 2;  // i is 2.

Add constant folding:

    int a = ... // some value.
    a = a + 3;  // 0 + 1 + 2.

Loop unrolling, in of itself, works to remove the loop overhead like the branch back to the top of the loop, and possibly the iterator incrementing, as well as the limit check. But when combined with other optimizations, as we can see, the performance gains can be dramatic. It's not possible for the JPU to implement this optimization too because by contract, the JPU needs to execute the bytecodes as specified.

In practice, loop unrolling is not as trivial as the example shown above. Consider what happens if the loop iterator limit is a variable (as opposed to a constant) that is passed into the method. How many iterations do we unroll the loop into then? Alternatively, what if the limit is a very large constant? Unrolling all the way to the limit can result in some serious code bloat, which in turns reduces cache locality and can hurt performance. What if the code inside the loop can throw an exception e.g. indexing into an array beyond its bounds? I won't go into the details of how a JIT deals with all these. I just want to point out that there are a lot of extra complexity to this optimization then initially apparent.

Loop unrolling is not currently available in the CVM JIT. It is not easy to implement, and it is not an important nor cost-effective optimization to implement for the CDC space based on our previous experience. That's not to say that things won't change in the future.

Loop Invariant Hoisting

Consider this example:

    void foo(int[] data) {
        int a = ... // some value.
        for (int i=0; i < data.length; i++) {
            ...;
        }
    }

In the above example, the length of the array is fetched in every iteration of the loop. If the JIT can determine that the array data won't change in length inside the loop, we can hoist the fetching of its length outside of the loop so that we don't incur the cost repeatedly for each iteration. The JIT effectively emits code that does the following:

    void foo(int[] data) {
        int a = ... // some value.
        // pre-fetch the array length into a register:
        int tempReg = data.length;
        for (int i=0; i < tempReg; i++) {
            ...;
        }
    }

This type of optimization is called loop invariant hoisting. In the JIT's case, fetching the array length requires accessing the array's data structure in memory (and memory accesses are expensive). Prefetching it into a register will allow the JIT to avoid this cost on every loop iteration. The JPU on the other hand has to execute the bytecode verbatim. As a result, it will fetch the array length on every loop iteration.

More advanced cases of loop invariant hoisting includes interactions with inlining. Let's say the body of the loop invokes some method that gets inlined. If the method happens to perform some operation that is invariant, that operation can be hoisted out of the loop to avoid unnecessary redundant work. This is, of course, not possible for the JPU to implement because of the inlining issues.

Loop invariant hoisting is not currently available in the CVM JIT. It isn't easy to implement in a generic way. Again, it isn't the most important optimization to have for applications in the CDC space.

Common Subexpression Elimination

Consider this example:

    int a = p.value + p.value;

The bytecodes for the above include 2 fetches of the field value from the object p. Field accesses will result in memory accesses which can be expensive. The JIT recognizes that the above code can be expressed as follows:

    int tempReg = p.value;
    int a = tempReg + tempReg;

In this case, the fetching of the field is a subexpression of the addition expression. The JIT eliminated one subexpression by fetching the field only once and reusing its value as the second operand in the addition. In this case, it saves one memory access. This optimization is called common subexpression elimination (aka CSE). In contrast, a JPU will have to execute the bytecode verbatim and do the field access twice.

The above is only a very simple form of CSE. More complex forms exists, and those take a lot more effort to implement in the JIT. Some block local types of CSE is available in CVM's JIT.

Intrinsic Methods

Consider the following example:

     ...
     time = System.currentTimeMillis();
     ...

The JPU will execute the above as a method invocation to a native method that gets the systems millisecond timer value.

Let's say we have a system that the milliseconds timer is a 64 bit hardware timer/counter that is memory mapped. In other words, software can read from it directly at some address in memory. A JIT can take advantage of this knowledge. Instead of emitting code that invokes the System.currentTimeMillis() method, it emits a single memory load from the location of the hardware timer. The gain here is that we need not incur all the method call overhead, as well as other cost for invoking a native method (see Beware of the Natives). In other words, the JIT can eliminate many hundreds of machine cycles down to a single 64-bit memory access.

This optimization is called intrinsifying the method, or using intrinsic methods. The idea is basically that there are certain standard library methods that the JIT knows the semantics/behavior of. This special knowledge allows the JIT to emit code that implements the semantics of the method without doing an actual method call, or alternatively, to do the method call in a less expensive manner.

Intrinsics is also one way that the JIT can make use of special hardware features instead of calling a software method. For example, Math.cos() can be replaced with a cos instruction if the hardware provides such a feature.

A JPU can't implement this optimization because it has to execute the invoke bytecode as specified. There's also the hurdle of needing to understanding the VM's constant pool structure, and having to deal with resolution, class initialization, etc. that I mentioned earlier. In the least, a JPU cannot afford to implement as many intrinsics (in number and types) as a JIT.

Intrinsics are available in CVM's JIT.

Closing Thoughts

Again, theoretically it is possible to implement any software features in hardware. However, the cost of doing so makes it impractical, and therefore effectively impossible.

Also, so far, I've been saying that a JPU can't implement all these optimizations because it has to execute the bytecodes verbatim. You might ask: why can't the JPU solution employ some sort of code transformation like the JIT so that it doesn't have to execute bytecode in a simple minded way i.e. verbatim? Well, if you do that, then what you have is a JIT. Code transformation is what a JIT does. It transform bytecodes into a form that is optimal for the CPU to execute. Hence, by definition, a JPU (without a JIT) must execute the bytecode verbatim, and consequently, will not be able to implement JIT type optimizations.

Another reminder: the above is a only sampling of possible JIT optimizations. This list is neither exhaustive nor representative of all the most important / cost-effective optimizations that a JIT can implement, though some of these are really important. Inlining is one that yields a lot of performance gain without too much cost when applied in a JIT.

Ok, time to stop. I hope this article helps shed some additional light on this topic. Have a nice day. :-)


BTW, regarding JavaOne, I will probably be there on one or more days. If folks are interested in getting together to have a little technical discussion, I'd be happy to oblige (assuming schedules will allow it).

Related Topics >>