Search |
||
JIT Performance: Defying Physics?Posted by mlam on February 21, 2007 at 5:56 PM PST
A few days ago, I came across a few blog entries that referenced my previous article. They are: When is software faster than hardware? by Matthew Schmidt, and Can JIT'ed Code be Faster than Hardware Accelleration by Kirk Pepperdine. These blog entries had received some comments that I thought deserved a response. So below, I will try to address issues raised in some of those comments, as well as provide an intuitive understanding of why you would expect a JIT to outperform a JPU. Resources: When is Software faster than Hardware?, Software Territory: Where Hardware can't go! Let's start with ... Physics Shmeesics
True. But, let's consider an alternative perspective as illustrated by this analogy ...
In this case, our concern is not about what the top speed of the car is. What we really care about is how we can get to our destination quickly. There are 3 ways to solve this problem:
Approach 1: Upgrade to a Ferrari Approach 2: Use a Special Vehicle Approach 3: Change the Route Cars and Computers Solution 1: Use a faster CPU Solution 2: Use a Java processor Solution 3: Use an optimizing JIT But why should the JIT's approach be faster than the first 2? When We're talking about a faster CPU (solution 1), I think I don't need to say much. Everyone knows that you can run an interpreted VM faster if you get faster hardware. But if you deploy a JIT, you can get more bang out of your old hardware. Which comes out ahead in terms of speed is not always obvious. It depends on many factors. As for the JPU vs the JIT, I'm assuming that the comparison is based on equivalent hardware features (with the sole exception of how the Java bytecodes are executed). For example, same amount of cache, same memory speed, same CPU clock speed, same pipelining architecture, same superscalar / scalar architecture, etc. In the car analogy, this means that the JPU approach is effectively using the same 60mph car engine. The difference is in the other car components that allows the JPU to "turn corners" quickly. The reason the JIT outperforms the JPU is because it takes the approach of changing the route taken. It doesn't need to be fast at doing corners. It simply takes the corners out of the route by taking local streets. In some cases, the JIT may even build a new highway between the 2 traversal points, thereby allowing the journey to be made in a straight line. As you know, the shortest distance between 2 points is a straight line. As fast as a JPU can be at doing those corners, the sheer distance that need to be travelled along the original route is just too long compared to the straight line that the JIT takes. False Analogy or Reality? The Java VM specification defines the VM as a stack-based execution engine. The stack is used to pass operands between bytecodes, as well as to pass arguments / return values to / from methods. The stack is the "main road" that is taken for routing data. The JPU approach tries to execute bytecodes as its own native instructions. Hence, the JPU must also necessarily use the stack abstraction as the path for routing data. The JIT is not confined by this stack architecture. It transforms the stack-centric bytecode into randomly-accessed memory and register-centric CPU machine code. All the redundant operations (the twists and turns) associated with pushing and popping stack operands can be eliminated. This reduces unnecessary memory access overhead. Caching Stack Operands and Locals in Registers Instruction Pipelines and Superscalar Execution The JIT Super-Highway Inlining removes the twists and turns (overhead) that show up in method calls / returns (i.e. arguments / return value passing, frame pushing /popping, etc). The code content of the methods in a hot code path are laid out in a straight line for the CPU to execute. Method call overhead need not be incurred when execution crosses from a caller to a callee, and vice versa. This is the by-product of inlining. Hence, in essense, the inlining optimization is effectively analogous to creating a highway that allows the code to be executed in a straight line. Inlining changes the route of code execution via code transformation. This is not possible for the JPU which needs to adhere to the original stack-centric route by definition. Micro / Macro Perspectives In compiler work, optimizations can be classified in categories according to scope. Peephole optimizations only focus on making a very small unit of code (a few instructions) run faster. Local optimizations generally describes a slightly broader range. For example, the scope may be limited to a basic block of code, or a method. Lastly, global optimizations can span multiple blocks or methods. The disadvantage of hardware optimizations is that they are usually only peephole optimizations. Whereas, a JIT is free to apply all levels of optimizations. It isn't because it is theoretically impossible for hardware to implement higher level optimizations. However, it becomes prohibitively difficult and expensive. Hence, for practical purposes, it is impossible. To get an idea of the power of higher level optimizations, consider the type of speed up you'll get by replacing a bubblesort algorithm with quicksort. A low level optimization done by hardware (like those in a JPU) focus on something near the instruction level like improving the compare instruction in the bubblesort loop. A High level optimization would be to replace the sort algorithm altogether with a quicksort. The quicksort algorithm achieves exactly the same results, but is faster because it takes an entirely different route eliminates all the redundant work in a bubblesort. Now, a JIT isn't quite so advanced to arbitrarily be able to rewrite your algorithms for you. It is still quite possible to write bad code that will yield bad performance even when run with a JIT. There is no substitute for good engineering (which I hope is why we all still have jobs). However, this sorting example does give you an idea of the benefits of higher level optimizations vs low level ones. Hardware isn't slow! Software makes it slow! ... Or does it? But before we go blaming software or the Java platform for performance problems, let's get some proper perspective. Here's an analogy to illustrate this: A CPU is like a car engine. If it's not hooked up to anything, it can do very high number of rotations per minute (RPM). Blinding speeds, and great power. However, this raw power doesn't produce any useful work that a human user would normally care about. Next, we add assembly code. To the car engine, you hook up a minimal frame with a front wheel with a steering rod attached to it. To the engine itself, you attach some wheels. Now, you have something that can do useful work ... take a human from point A to point B. It can probably run very fast too (not much extra weight to carry around). This infrastructure attached to the engine is simple and minimalistic. It is also dangerous and error prone. Beware of it exploding in your face. Stepping it up a notch, we have C code (i.e. high level assembly). This time, you have a box car frame, a gas pedal, a gas throttle, a gear box, ... mechanisms to help the driver better control the engine. The problem is that these mechanisms are not highly automated. It is still too easy for the human driver to make mistakes e.g. stepping on too much gas, being too rough on the gear shifting, etc. The vehicle is more user friendly but is still close to the metal. The driver gets a lot of low level control over how to divert the power of the engine. A very skilled driver can harness the power of the engine very efficiently. A less experience driver (or a very skilled driver on a bad day even) can run the car into the ground, or slam it into a wall. Further up the chain still, we have the Java platform. At this level, you get the cruise control, automatic gear shift, anti-lock brakes, fuel injection system, etc. With this, the driver can focus more on charting the trip he/she wants to make, and less on how to control the fuel system, gear shifting, and power efficiency of the car. The advanced infrastructure that comes with this car takes care of all that. This advanced infrastructure is the Java VM and its execution unit (a JIT or a JPU, or both). If your focus of performance is on how many RPMs or maximum torque the engine can yield, then yes, you are correct about software slowing down the performance of the engine. However, if your focus is on performance attained in doing useful work for a human being, then the problem of getting performance is not about increasing the engine RPM or maximum torque. It is more about reducing the inefficiencies that occur when you attach the surrounding infrastructure to your engine. This is the type of performance that I have been talking about. If you are already a user of the Java platform, chances are you already know about all the benefits that it gives you (and the costs as well). I won't go into those here. What remains is how you can get greater performance for the work your code needs to do. The JIT and JPU are two infrastructure subsystems that harness the power of the underlying hardware. The JPU approach happens to be implemented in more hardware. However, both harness the power of the underlying computational facilities of a CPU of some sort. In the case of the JPU, its efficiency is not very high because of its stack-centric nature, and its inability to optimize high level work that is commonly done by a Java VM. The JIT on the other hand is a highly efficient because it is, by its very nature, designed to overcome the very limitations that hurt the JPU's peformance. Hence, in this case, software (the JIT) is faster than hardware (the JPU). And, no, the JIT isn't defying physics. It is just making better use of it. Final Thoughts That said, if you are comparing a JPU with more advanced CPU features (e.g. higher clock speed, more internal cache, superscalar vs scalar, etc.) than your current general purpose CPU, then the JPU may be the better solution for you. This is, of course, assuming you can replace your board level hardware design, your memory budget is extremely tight, and the JPU costs less than a general purpose CPU with comparable hardware features. This makes sense in some case. However, in most cases, the general purpose CPU will be the better choice because of other (non-performance related) reasons like greater availability of OSes (e.g. linux), device drivers, supported third party software libraries, and also because of the need to support legacy code / applications which are usually written in C. If the Java VM must co-exists with these, the business case for the JPU would be less compelling. Anyway, that is beyond the scope of this discussion. I hope that this article has made the performance issue easier to understand from an intuitive sense. Till the next one, have a nice day. :-) Tags: phoneME Advanced, CVM, CDC, JIT, Java processor, software engineering, performance »
Related Topics >>
Mobile and Embedded Comments
Comments are listed in date ascending order (oldest first)
|
||
|
|