|
|
||
Mark Lam's BlogCVM's JIT: Another BIG PicturePosted by mlam on January 11, 2007 at 04:33 AM | Comments (0)In my last few entries, I've been talking about a bug I'm currently fixing. One of the reason I haven't been updating daily is because said bug is taking a lot more of my time than expected. There is always more to the picture than meets the eye. Anyway, in my last entry, I briefly discussed the internals of the CVM (phoneME Advanced VM)'s JIT (officially, the dynamic adaptive compiler). Since the bug that I need to fix involved adding functionality to the JIT, we need to know in greater detail how the JIT works (or at least be able to know our way around the code). So today, we'll leave the bug fix alone for a while, and talk about the JIT's BIG Picture ... Click on the map to get a popup window with a 1024 x 768 res bitmap of the map (if you want to view it in a separate window). Or click here to view the map in a PDF file. I highly recommend using the PDF if you plan to do a printout of the map. And here's how to read the map ... Start at the Beginning! The blue columns represent data items, and the yellow columns represent the JIT code. Note the big fat arrows that chain these columns together. The fat arrows shows the JIT compilation process in action. On the far left column, we start with the method bytecodes of a method that we want to compile. The bytecodes are passed into the 2nd column, the JIT front-end. The JIT front-end generates an intermediate representation (IR) of the method. Next, the IR is passed to the JIT back-end, which generates the machine code we see in the 5th column. In summary: bytecode -> JIT Front-End -> IR -> JIT Back-End -> machine code To Compile, or Not to Compile At initialization time, the method counter is set to a value called the cost limit (aka the climit). This value represents the "maximum" cost that the VM is willing to spend on interpreting this method. Each time we enter the method, we decrement the counter (i.e. the allowable invoke cost) by the interpretation cost (aka the icost). Each time the method does a backwards branch, the counter is decremented by the branch cost (aka the bcost). Each time the method invokes a compiled method or is invoked by a compiled method, the counter is decremented by the ccost (... ummm ... the compiled-to-interpreter-crossing cost?). Note: these values can be set at the CVM command line using -Xjit:climit=<value>,icost=<value>,ccost=<value>,bcost=<value>. The values can be specified together or individually, each delimited by a ','. Legal values are in the range from 0 to 65535. To give you an idea of realistic values that work well, the current defaults (as of this writing) are climit 20000, icost 20, ccost 50, bcost 4. The defaults are defined in jit_common.h. Before a method gets hot (i.e. the counter hasn't reached 0 yet), it is executed by the interpreter. While doing this execution, the interpreter takes care of decrementing the counter and checking to see if it has reached 0. When the counter reaches 0, the interpreter will tell the JIT to compile the method. Look in executejava_standard.c and jit_common.c here for the interpreter loop and some JIT helper functions called by the interpreter. Kicking It Off Bytecode To IR The IR generator may call into an IR optimizer that re-writes the IR in an additional optimization pass. By default, the IR generator itself will already perform a lot of optimizations. A few examples are null-pointer check elimination, local common sub-expression elimination, and method inlining. To see some examples of generated IR, you can build CVM with the build options CVM_TRACE_JIT=true or CVM_DEBUG=true, and then run with the command line option -Xjit:trace=bctoir. Note that you will want to run a more substantial program than HelloWorld. Otherwise, you won't see any IR trace output. This is because HelloWorld doesn't run long enough to be compiled. In the IR, a method is structured as a list of blocks. Each block represents an extended basic block in the method bytecode (i.e. single entry at the top of the block, but multiple exits via branches or fall thru at the end). Each block contains a list of DAGs (directed acyclic graphs) which define statements and expressions. Each node in the DAGs have an opcode and sometimes attributes, flags, and decorations that provide additional information about the operation that the node represents. The operands of the node are other nodes that hang off the node in the DAG tree. These node opcodes are later used to cue the JIT back-end in the code generation phase. Note: the entire JIT front-end is 100% shared code (denoted by the yellow boxes) that is common between all target platforms. The majority of JIT optimization are done here, thereby benefiting all target platforms. The IR also encapsulates most semantics of Java code. As a result, implementors of the JIT back-end does not need as much knowledge about Java VM semantics. For example, the IR will specify where null pointer checks need to be done. The back-end doesn't need to have this knowledge. The Back-End Note that the JIT back-end is, by definition, target platform specific. However, CVM provides a RISC JIT layer (see orange boxes) that is common for all target CPU architectures that are RISC-like. RISC, in this case refers to the common use of the term, which means CPUs with features like uniform length instructions, a relatively large register file, and a reduced instruction set. Some examples of CPU architectures that meet this specification include ARM, MIPS, PowerPC, and Sparc. There are other CPU architectures that are called RISC but do not have some of the above features. For such architectures, the RISC layer may require significantly more customization or may not work at all. In the least, the RISC layer code can be used as a reference for other JIT implementations. For example, the x86 JIT back-end was based originally on this RISC layer even though the CPU architecture is significantly different. Mind your Grammar During the code generation phase, for each IR node, an opcode is fetched from the node. This opcode is then passed through an opcode mapper (see CVMJITgetMassagedIROpcode() in jitopcodes.c). This mapper (or massager) maps similar opcodes into a common representative opcode. This allows us to reduce the number of grammar rules that need to be written. For example, boolean, byte, char, short, float, and int getfields can all be implemented using the same 32-bit getfield implementation. So, the mapper can map all these getfield opcodes into the int getfield opcode since they would all generate identical code anyway. This in turn means that we only need a single rule to generate the code for an int getfield operation. We don't need variants for the other types specified above. Like the IR, the grammar rules also need to encapsulate some Java VM semantics. For example, it needs to know when exceptions can be thrown so that it can save the thread stack state before the exception potentially gets thrown. This is necessary since the exception handling relies on stack information. Another example is the capture of stackmaps. The grammar rules need to know when garbage collection (GC) can occur, and make sure to capture a stackmap at that point. The stackmaps are necessary in order for the GC to do its work. Since these actions are relate to code generation, there's no way for the JIT front-end to alleviate the back-end of this burden. Note: the grammar rules are synonymous with the code generator (aka codegen) since they work as the driver mechanism for generating code. To see some examples of codegen output, build CVM with build options CVM_TRACE_JIT=true or CVM_DEBUG=true, and run with the command line option -Xjit:trace=codegen. Again, make sure to run a more substantial program than HelloWorld. Otherwise, you won't see any codegen trace output due to HelloWorld not running long enough to be compiled. RISCy Business Emit Who? When porting to a new CPU architecture, these emitter functions need to be written. Emitter functions are mostly CPU specific code. There are some "complex" emitters provided by the RISC layer that performs the semantics of emitting instructions but actually does its work by calling into one or more of the emitters which are CPU specific. To write CPU specific emitters, one only need to know of the semantics that the RISC layer expects of the instruction, as well as the CPU specific instruction encodings. Emitters usually emit single instructions, but may occasionally emit more than one instruction to implement the semantics if the CPU does not have a native instruction that will do the job. For an example of emitters, see the ARM version of jitemitter_cpu.c. Have it Your Way! Overriding does not mean that all the RISC rules are replaced at once. Instead, the CPU specific rules are added to the RISC rules. But through the use of a cost value system, JCS will select the CPU specific rules over the RISC rules thereby giving the affect of rules overriding. If the CPU specific version does not provide alternatives for some rules, the corresponding RISC version will be used as before. This same overriding mechanism is used to add rules that will generate code for a hardware floating point unit. If an FPU is available for the target CPU, the float grammar rules can be added into the build, and JCS will choose the hardware implementation over the default soft float ones. Note that this overriding mechanism essentially gives the JIT porting engineer the ability to layer on more back-end optimizations over time. In the initial stages of porting, the implementation can go with the default RISC implementation provided. As time permits, custom rules that make use of custom emitters, or implement different algorithms can be added on to squeeze more performance out of the compiled code. The Output Once the method is compiled, the next time the method is invoked, the interpreter will dispatch to the compiled code instead of continuing to interpret it. The mechanism for dispatching to compiled code was discussed in a previous article here. If we're already in the middle of executing the method and the compilation was triggered by a long running loop, then the on-stack replacement (OSR) mechanism will be employed to replace the interpreter frame of the method with its equivalent compiled frame. Thereafter, execution will continue in the compiled method instead of the interpreted one. A discussion of the OSR mechanism is available here in a previous article. The Help The C helpers can be found in ccm_runtime.c. Examples of assembler glue for the ARM port can be found in ccmglue_cpu.S, ccmallocators_cpu.S and ccminvokers_cpu.S here. During code execution, control of the CPU will transfer back and forth between the compiled code, the helpers, as well as to the interpreter and other VM runtime library functions. However, entry into and exit from the compiled code always goes through the CCM helpers. Another aspect of compiled code execution, the effects on the runtime stacks, is discussed in a previous article. And when you come to the End, STOP! If you take a step back, and just look at the boxes of yellow, orange, and red which make up the CVM JIT, you'll see that very few parts are actually CPU specific (red boxes). If your CPU architecture isn't RISC-like (as defined above), then you have the option of re-writing the entire JIT back-end though it is not advised. Rewriting the entire back-end is a daunting task. Most people would opt wisely to adapt the RISC layer for their needs instead. Regardless, this, of course, still takes more work than if the CPU architecture is RISC-like to begin with. As mentioned above, a central design philosophy behind the CVM JIT is that we try to keep all Java VM semantics in as much of the shared code as possible. There is a need to implement some of these semantics in the production rules layer. The RISC implementation layer will encapsulate the remaining bits of these semantics. After that, all that remains are the instruction emitters which does not need to know anything about Java semantics. In order to simplify the porting effort (and reduce the chance of semantic bugs being introduced into the VM), the JIT is structured so that the porting engineer can focus mostly on the emitters. As a result, CVM's JIT has been proven to be easily portable. On the other hand, the ability to override default implementations in the JIT allows the more sophisticated porting engineer to implement more advance optimizations to gain additional performance, but only as their development schedule permits. Note also that this article decribes CVM's JIT. Not all JITs are structurally the same. For example, the phoneME Feature VM's JIT is very different than this. What's Next? Till then, have a nice day. :-) Bookmark blog post: CommentsComments are listed in date ascending order (oldest first) | Post Comment | ||
|
|