JIT Performance: Defying Physics?
A few days ago, I came across a few blog entries that referenced my
article. They are:
software faster than hardware? by Matthew Schmidt, and
JIT'ed Code be Faster than Hardware Accelleration by Kirk Pepperdine.
These blog entries had received some comments that I thought deserved a response. So below, I will try to address issues raised in some of those comments, as well as provide an intuitive understanding of why you would expect a JIT to outperform a JPU.
is Software faster than Hardware?, Software
Territory: Where Hardware can't go!
Let's start with ...
How can software possibly run faster than hardware? I've made this
statement myself numerous times in the past. What was I thinking when I made that statement? Well, basically, the thought goes: if a piece of software is running on some given hardware, the performance of that software is ultimately gated by the hardware it runs on. This can be illustrated with an analogy as follows ...
A given car can travel at a top speed of 60 miles per hour. No matter who (a new student driver or a professional race car driver) gets into the driver seat, this car is not going to go any faster than 60 mph.
True. But, let's consider an alternative perspective as illustrated by this analogy ...
I need to get from this side of town to that side of town. Because of the layout of the town, the trip will take us through a lot of twists and turns: the main road from start to destination is not straight. How can I shorten the amount of time needed to make this trip?
In this case, our concern is not about what the top speed of the car is. What we really care about is how we can get to our destination quickly. There are 3 ways to solve this problem:
- Get a faster vehicle.
- Get a specialized vehicle that can do twists and turns quickly.
- Chart a different route with shortcuts.
Approach 1: Upgrade to a Ferrari
If we increase the top speed of the vehicle, we will be able to run each straight leg in the route a little quicker, thereby reducing the overall time to get to the destination.
Approach 2: Use a Special Vehicle
Since we have to do a lot of corners (with those twists and turns), if we use a special custom built vehicle that does corners much faster than the old car we've got, then we can traverse the route without having to slow down much, and reduce the overall time to get to the destination.
Approach 3: Change the Route
Forget the main road. If we take shortcuts through local roads, we can shorten the effective distance that needs to be travelled. As a result, the overall trip time is reduced.
Cars and Computers
What has any of this got to do with Java bytecode execution? Well, these 3 approaches above represent the following solutions to getting better performance:
Solution 1: Use a faster CPU
A faster CPU (and supporting hardware system) executes everything faster. Obviously, that should reduce the amount of time needed to execute some sequence of bytecodes.
Solution 2: Use a Java processor
A JPU can execute the bytecodes (at least quite a few of them) natively. Hence, it incurs less overhead to execute those bytecodes and reduces the overall execution time.
Solution 3: Use an optimizing JIT
A JIT transforms the bytecodes into a form that can be executed quickly by the CPU. The JIT does not naively translates the bytecode one at a time, but also restructures it to reduce the amount of work that needs to be done. As a result of these code transformations, while the end result is the same, the actual path taken to execute the bytecodes is different and a lot shorter. Hence, the overall execution time is reduced.
But why should the JIT's approach be faster than the first 2? When We're talking about a faster CPU (solution 1), I think I don't need to say much. Everyone knows that you can run an interpreted VM faster if you get faster hardware. But if you deploy a JIT, you can get more bang out of your old hardware. Which comes out ahead in terms of speed is not always obvious. It depends on many factors.
As for the JPU vs the JIT, I'm assuming that the comparison is based on equivalent hardware features (with the sole exception of how the Java bytecodes are executed). For example, same amount of cache, same memory speed, same CPU clock speed, same pipelining architecture, same superscalar / scalar architecture, etc.
In the car analogy, this means that the JPU approach is effectively using the same 60mph car engine. The difference is in the other car components that allows the JPU to "turn corners" quickly. The reason the JIT outperforms the JPU is because it takes the approach of changing the route taken. It doesn't need to be fast at doing corners. It simply takes the corners out of the route by taking local streets. In some cases, the JIT may even build a new highway between the 2 traversal points, thereby allowing the journey to be made in a straight line. As you know, the shortest distance between 2 points is a straight line. As fast as a JPU can be at doing those corners, the sheer distance that need to be travelled along the original route is just too long compared to the straight line that the JIT takes.
False Analogy or Reality?
The next thing you'll probably ask me is whether this highway analogy is truly representative of what actually happens in how a JIT gets performance, or is it just a pretty story not grounded in reality? The answer is that it is a fairly accurate analogy.
The Java VM specification defines the VM as a stack-based execution engine. The stack is used to pass operands between bytecodes, as well as to pass arguments / return values to / from methods. The stack is the "main road" that is taken for routing data. The JPU approach tries to execute bytecodes as its own native instructions. Hence, the JPU must also necessarily use the stack abstraction as the path for routing data.
The JIT is not confined by this stack architecture. It transforms the stack-centric bytecode into randomly-accessed memory and register-centric CPU machine code. All the redundant operations (the twists and turns) associated with pushing and popping stack operands can be eliminated. This reduces unnecessary memory access overhead.
Caching Stack Operands and Locals in Registers
Even for JPU implementations that cache the top N words of the operand stack and some local variables in registers, the stack-centric nature of the JPU requires that data be moved back and forth between these registers which mirror the locals and the stack. All this data motion is not necessary in JIT compiled code.
Instruction Pipelines and Superscalar Execution
What is you have a JPU with a deep instruction pipeline or a superscalar architecture that can effectively handle all that data motion between the stack and locals registers for free (i.e. 0 or near 0 cycles)? Assuming this is even possible, in this case, the JPU is only at best matching the performance of the JIT. Note that the JPU needed a lot of advanced hardware features in order to eliminate an overhead that was something that a JIT naturally eliminates.
The JIT Super-Highway
Even given the above advanced hardware features in a JPU, the JIT still comes out ahead because of the other optimizations that a JIT can do. I mentioned that the JIT can build a highway so that the route is now a straight line. An example that demonstrates this best is the JIT optimization of inlining.
Inlining removes the twists and turns (overhead) that show up in method calls / returns (i.e. arguments / return value passing, frame pushing /popping, etc). The code content of the methods in a hot code path are laid out in a straight line for the CPU to execute. Method call overhead need not be incurred when execution crosses from a caller to a callee, and vice versa. This is the by-product of inlining. Hence, in essense, the inlining optimization is effectively analogous to creating a highway that allows the code to be executed in a straight line.
Inlining changes the route of code execution via code transformation. This is not possible for the JPU which needs to adhere to the original stack-centric route by definition.
Micro / Macro Perspectives
But why can't hardware optimizations employ the same approach of finding shortcuts and creating bypass highways that the JIT does? Actually, they can and do in their own way. But the limitation lies in the scope to which the optimizations can be applied. Hardware's perspective is limited to making one or a few instructions execute faster. A JIT gets to optimize at a much higher level (across multiple methods), and therefore can see more opportunities for optimizations.
In compiler work, optimizations can be classified in categories according to scope. Peephole optimizations only focus on making a very small unit of code (a few instructions) run faster. Local optimizations generally describes a slightly broader range. For example, the scope may be limited to a basic block of code, or a method. Lastly, global optimizations can span multiple blocks or methods. The disadvantage of hardware optimizations is that they are usually only peephole optimizations. Whereas, a JIT is free to apply all levels of optimizations.
It isn't because it is theoretically impossible for hardware to implement higher level optimizations. However, it becomes prohibitively difficult and expensive. Hence, for practical purposes, it is impossible.
To get an idea of the power of higher level optimizations, consider the type of speed up you'll get by replacing a bubblesort algorithm with quicksort. A low level optimization done by hardware (like those in a JPU) focus on something near the instruction level like improving the compare instruction in the bubblesort loop. A High level optimization would be to replace the sort algorithm altogether with a quicksort. The quicksort algorithm achieves exactly the same results, but is faster because it takes an entirely different route eliminates all the redundant work in a bubblesort.
Now, a JIT isn't quite so advanced to arbitrarily be able to rewrite your algorithms for you. It is still quite possible to write bad code that will yield bad performance even when run with a JIT. There is no substitute for good engineering (which I hope is why we all still have jobs). However, this sorting example does give you an idea of the benefits of higher level optimizations vs low level ones.
Hardware isn't slow! Software makes it slow! ... Or does it?
Lastly, there was one comment that asked if the problem was actually with the Java VM instruction set not being designed for processors, rather than the JPU hardware itself. In a sense, the gentleman was dead on correct. The Java VM specification defines a high level computing platform. It is not a CPU specification. This is one reason why it would be so difficult to make a JPU (which a CPU trying to be a Java VM) that has high performance.
But before we go blaming software or the Java platform for performance problems, let's get some proper perspective. Here's an analogy to illustrate this:
A CPU is like a car engine. If it's not hooked up to anything, it can do very high number of rotations per minute (RPM). Blinding speeds, and great power. However, this raw power doesn't produce any useful work that a human user would normally care about.
Next, we add assembly code. To the car engine, you hook up a minimal frame with a front wheel with a steering rod attached to it. To the engine itself, you attach some wheels. Now, you have something that can do useful work ... take a human from point A to point B. It can probably run very fast too (not much extra weight to carry around). This infrastructure attached to the engine is simple and minimalistic. It is also dangerous and error prone. Beware of it exploding in your face.
Stepping it up a notch, we have C code (i.e. high level assembly). This time, you have a box car frame, a gas pedal, a gas throttle, a gear box, ... mechanisms to help the driver better control the engine. The problem is that these mechanisms are not highly automated. It is still too easy for the human driver to make mistakes e.g. stepping on too much gas, being too rough on the gear shifting, etc. The vehicle is more user friendly but is still close to the metal. The driver gets a lot of low level control over how to divert the power of the engine. A very skilled driver can harness the power of the engine very efficiently. A less experience driver (or a very skilled driver on a bad day even) can run the car into the ground, or slam it into a wall.
Further up the chain still, we have the Java platform. At this level, you get the cruise control, automatic gear shift, anti-lock brakes, fuel injection system, etc. With this, the driver can focus more on charting the trip he/she wants to make, and less on how to control the fuel system, gear shifting, and power efficiency of the car. The advanced infrastructure that comes with this car takes care of all that. This advanced infrastructure is the Java VM and its execution unit (a JIT or a JPU, or both).
If your focus of performance is on how many RPMs or maximum torque the engine can yield, then yes, you are correct about software slowing down the performance of the engine.
However, if your focus is on performance attained in doing useful work for a human being, then the problem of getting performance is not about increasing the engine RPM or maximum torque. It is more about reducing the inefficiencies that occur when you attach the surrounding infrastructure to your engine.
This is the type of performance that I have been talking about. If you are already a user of the Java platform, chances are you already know about all the benefits that it gives you (and the costs as well). I won't go into those here. What remains is how you can get greater performance for the work your code needs to do. The JIT and JPU are two infrastructure subsystems that harness the power of the underlying hardware. The JPU approach happens to be implemented in more hardware. However, both harness the power of the underlying computational facilities of a CPU of some sort.
In the case of the JPU, its efficiency is not very high because of its stack-centric nature, and its inability to optimize high level work that is commonly done by a Java VM. The JIT on the other hand is a highly efficient because it is, by its very nature, designed to overcome the very limitations that hurt the JPU's peformance. Hence, in this case, software (the JIT) is faster than hardware (the JPU).
And, no, the JIT isn't defying physics. It is just making better use of it.
If performance is your main criteria, a JPU is probably not your best option. However, there are many good reasons to use a JPU. If designed and implemented well, a JPU can yield benefits of a slightly smaller runtime code footprint, lower power consumption, and better performance than a software interpreter. For performance, hardware acceleration of other types can help significantly. Examples include advanced memory systems, media acceleration, hardware encoder/decoders, etc. These usually work in conjunction with JITs. But a JPU is not the best answer for high performance.
That said, if you are comparing a JPU with more advanced CPU features (e.g. higher clock speed, more internal cache, superscalar vs scalar, etc.) than your current general purpose CPU, then the JPU may be the better solution for you. This is, of course, assuming you can replace your board level hardware design, your memory budget is extremely tight, and the JPU costs less than a general purpose CPU with comparable hardware features. This makes sense in some case. However, in most cases, the general purpose CPU will be the better choice because of other (non-performance related) reasons like greater availability of OSes (e.g. linux), device drivers, supported third party software libraries, and also because of the need to support legacy code / applications which are usually written in C. If the Java VM must co-exists with these, the business case for the JPU would be less compelling. Anyway, that is beyond the scope of this discussion.
I hope that this article has made the performance issue easier to understand from an intuitive sense. Till the next one, have a nice day. :-)
Tags: phoneME Advanced, CVM, CDC, JIT, Java processor, software engineering, performance