CVM's JIT: Another BIG Picture
In my last few entries, I've been talking about a bug I'm currently fixing. One of the reason I haven't been updating daily is because said bug is taking a lot more of my time than expected. There is always more to the picture than meets the eye. Anyway, in my last entry, I briefly discussed the internals of the CVM (phoneME Advanced VM)'s JIT (officially, the dynamic adaptive compiler). Since the bug that I need to fix involved adding functionality to the JIT, we need to know in greater detail how the JIT works (or at least be able to know our way around the code). So today, we'll leave the bug fix alone for a while, and talk about the JIT's BIG Picture ...
Click on the map to get a popup window with a 1024 x 768 res bitmap of the map (if you want to view it in a separate window). Or click here to view the map in a PDF file. I highly recommend using the PDF if you plan to do a printout of the map.
And here's how to read the map ...
Start at the Beginning!
First of all, take a step back and notice the color scheme. Ignoring the purple blocks for now, there are essentially 5 column blocks in the picture going from left to right: 3 blue-ish columns, and 2 yellow (and orange and red) -ish columns sandwiched between them.
The blue columns represent data items, and the yellow columns represent the JIT code. Note the big fat arrows that chain these columns together. The fat arrows shows the JIT compilation process in action.
On the far left column, we start with the method bytecodes of a method that we want to compile. The bytecodes are passed into the 2nd column, the JIT front-end. The JIT front-end generates an intermediate representation (IR) of the method. Next, the IR is passed to the JIT back-end, which generates the machine code we see in the 5th column. In summary:
bytecode -> JIT Front-End -> IR -> JIT Back-End -> machine code
To Compile, or Not to Compile
But before a JIT compiles a method, it must first determine if the method is hot i.e. worthy of being compiled. This hotness is determined by the JIT Compilation Policy. CVM's compilation policy is counter based. Each method is assigned a counter (see invokeCostX in CVMMethodBlock in classes.h). The X suffix in the field name of invokeCostX means that this field should not be accessed directly. There should be an accessor macro or function provided for accessing it. In this case, the accessors are CVMmbInvokeCostSet() and CVMmbInvokeCost() (also defined in the same file). This is a common convention you will see used in many of CVM's data structures.
At initialization time, the method counter is set to a value called the cost limit (aka the climit). This value represents the "maximum" cost that the VM is willing to spend on interpreting this method. Each time we enter the method, we decrement the counter (i.e. the allowable invoke cost) by the interpretation cost (aka the icost). Each time the method does a backwards branch, the counter is decremented by the branch cost (aka the bcost). Each time the method invokes a compiled method or is invoked by a compiled method, the counter is decremented by the ccost (... ummm ... the compiled-to-interpreter-crossing cost?).
Note: these values can be set at the CVM command line using -Xjit:climit=<value>,icost=<value>,ccost=<value>,bcost=<value>.
The values can be specified together or individually, each delimited by a ','. Legal values are in the range from 0 to 65535. To give you an idea of realistic values that work well, the current defaults (as of this writing) are climit 20000, icost 20, ccost 50, bcost 4. The defaults are defined in jit_common.h.
Before a method gets hot (i.e. the counter hasn't reached 0 yet), it is executed by the interpreter. While doing this execution, the interpreter takes care of decrementing the counter and checking to see if it has reached 0. When the counter reaches 0, the interpreter will tell the JIT to compile the method. Look in executejava_standard.c and jit_common.c here for the interpreter loop and some JIT helper functions called by the interpreter.
Kicking It Off
To compile the method, the interpreter calls CVMJITcompileMethod() in jitcompile.c. CVMJITcompileMethod() sets up the CVMJITCompilationContext (top blue box, second column). The compilation context is the root data structure used to track the compilation process. Some of the things that are tracked includes the method IR that is generated, and other data structures which are allocated from scratch working memory. The JIT's working memory is allocated in chunks using malloc, and is freed altogether after the compilation process is complete. The amount of working memory the JIT is allowed to use is capped by the CVM command line option -Xjit:maxWorkingMemorySize=<value>. The current default limit is 550 KB.
Bytecode To IR
CVMJITcompileMethod() calls CVMJITcompileBytecodeToIR() in jitir.c to generate the IR from the method bytecode.
The IR generator may call into an IR optimizer that re-writes the IR in an additional optimization pass. By default, the IR generator itself will already perform a lot of optimizations. A few examples are null-pointer check elimination, local common sub-expression elimination, and method inlining.
To see some examples of generated IR, you can build CVM with the build options CVM_TRACE_JIT=true or CVM_DEBUG=true, and then run with the command line option -Xjit:trace=bctoir. Note that you will want to run a more substantial program than HelloWorld. Otherwise, you won't see any IR trace output. This is because HelloWorld doesn't run long enough to be compiled.
In the IR, a method is structured as a list of blocks. Each block represents an extended basic block in the method bytecode (i.e. single entry at the top of the block, but multiple exits via branches or fall thru at the end). Each block contains a list of DAGs (directed acyclic graphs) which define statements and expressions. Each node in the DAGs have an opcode and sometimes attributes, flags, and decorations that provide additional information about the operation that the node represents. The operands of the node are other nodes that hang off the node in the DAG tree. These node opcodes are later used to cue the JIT back-end in the code generation phase.
Note: the entire JIT front-end is 100% shared code (denoted by the yellow boxes) that is common between all target platforms. The majority of JIT optimization are done here, thereby benefiting all target platforms. The IR also encapsulates most semantics of Java code. As a result, implementors of the JIT back-end does not need as much knowledge about Java VM semantics. For example, the IR will specify where null pointer checks need to be done. The back-end doesn't need to have this knowledge.
The completed IR is passed to the JIT back-end via the compilation context. CVMJITcompileMethod() invokes the back-end via a call to CVMJITcompileGenerateCode() in jit_risc.c.
Note that the JIT back-end is, by definition, target platform specific. However, CVM provides a RISC JIT layer (see orange boxes) that is common for all target CPU architectures that are RISC-like. RISC, in this case refers to the common use of the term, which means CPUs with features like uniform length instructions, a relatively large register file, and a reduced instruction set. Some examples of CPU architectures that meet this specification include ARM, MIPS, PowerPC, and Sparc. There are other CPU architectures that are called RISC but do not have some of the above features. For such architectures, the RISC layer may require significantly more customization or may not work at all. In the least, the RISC layer code can be used as a reference for other JIT implementations. For example, the x86 JIT back-end was based originally on this RISC layer even though the CPU architecture is significantly different.
Mind your Grammar
jit_risc is located in src/portlibs/jit/risc and is part of this RISC layer. A big part of the RISC layer is the code generation production rules (aka grammar rules). For example, see jitgrammarrules.jcs.
The grammar rules are written in jcs files which look like yacc source. These jcs files are compiled by a tool called Java Code Select (JCS). The tool is provided as part of the CVM source (see src/share/javavm/jcs).
JCS compiles these into C code which is then compiled and linked into the JIT.
During the code generation phase, for each IR node, an opcode is fetched from the node. This opcode is then passed through an opcode mapper (see CVMJITgetMassagedIROpcode() in jitopcodes.c).
This mapper (or massager) maps similar opcodes into a common representative opcode. This allows us to reduce the number of grammar rules that need to be written. For example, boolean, byte, char, short, float, and int getfields can all be implemented using the same 32-bit getfield implementation. So, the mapper can map all these getfield opcodes into the int getfield opcode since they would all generate identical code anyway. This in turn means that we only need a single rule to generate the code for an int getfield operation. We don't need variants for the other types specified above.
Like the IR, the grammar rules also need to encapsulate some Java VM semantics. For example, it needs to know when exceptions can be thrown so that it can save the thread stack state before the exception potentially gets thrown. This is necessary since the exception handling relies on stack information. Another example is the capture of stackmaps. The grammar rules need to know when garbage collection (GC) can occur, and make sure to capture a stackmap at that point. The stackmaps are necessary in order for the GC to do its work. Since these actions are relate to code generation, there's no way for the JIT front-end to alleviate the back-end of this burden.
Note: the grammar rules are synonymous with the code generator (aka codegen) since they work as the driver mechanism for generating code. To see some examples of codegen output, build CVM with build options CVM_TRACE_JIT=true or CVM_DEBUG=true, and run with the command line option -Xjit:trace=codegen.
Again, make sure to run a more substantial program than HelloWorld. Otherwise, you won't see any codegen trace output due to HelloWorld not running long enough to be compiled.
Other components in the RISC layer includes a stack manager which tracks the operand stack semantics for the code generator, a register manager that tracks register usage in the RISC register file, and a stackmap capture mechanism that captures the method stackmap at various points in code execution when requested by the codegen.
Emitters (short for instruction emitters) are functions that are called by the grammar rules to write out the encoding of CPU instructions. The RISC layer defines a RISC porting layer that includes a set of emitters.
When porting to a new CPU architecture, these emitter functions need to be written. Emitter functions are mostly CPU specific code. There are some "complex" emitters provided by the RISC layer that performs the semantics of emitting instructions but actually does its work by calling into one or more of the emitters which are CPU specific.
To write CPU specific emitters, one only need to know of the semantics that the RISC layer expects of the instruction, as well as the CPU specific instruction encodings. Emitters usually emit single instructions, but may occasionally emit more than one instruction to implement the semantics if the CPU does not have a native instruction that will do the job.
For an example of emitters, see the ARM version of jitemitter_cpu.c.
Have it Your Way!
The RISC grammar rules can be overridden by CPU specific grammar rules. For example, see the ARM specific version of the grammar rules. This overriding allows the codegen to make use of CPU specific features (e.g. special instructions) to implement certain operations more efficiently.
Overriding does not mean that all the RISC rules are replaced at once. Instead, the CPU specific rules are added to the RISC rules. But through the use of a cost value system, JCS will select the CPU specific rules over the RISC rules thereby giving the affect of rules overriding. If the CPU specific version does not provide alternatives for some rules, the corresponding RISC version will be used as before.
This same overriding mechanism is used to add rules that will generate code for a hardware floating point unit. If an FPU is available for the target CPU, the float grammar rules can be added into the build, and JCS will choose the hardware implementation over the default soft float ones.
Note that this overriding mechanism essentially gives the JIT porting engineer the ability to layer on more back-end optimizations over time. In the initial stages of porting, the implementation can go with the default RISC implementation provided. As time permits, custom rules that make use of custom emitters, or implement different algorithms can be added on to squeeze more performance out of the compiled code.
When the compilation process is done, the resultant compiled code (for the method) is stored in the JIT code cache. The IR and the rest of the scratch working memory that was allocated during the compilation process can now be released back to the system.
Once the method is compiled, the next time the method is invoked, the interpreter will dispatch to the compiled code instead of continuing to interpret it. The mechanism for dispatching to compiled code was discussed in a previous article here.
If we're already in the middle of executing the method and the compilation was triggered by a long running loop, then the on-stack replacement (OSR) mechanism will be employed to replace the interpreter frame of the method with its equivalent compiled frame. Thereafter, execution will continue in the compiled method instead of the interpreted one. A discussion of the OSR mechanism is available here in a previous article.
Not all bytecode have native counterparts in the target CPU instruction set. For example, object allocation and synchronization bytecodes can only be implemented as calls to runtime helper functions (aka CCM helpers). CVM provides C implementations for these helpers that are part of shared code. However, some assembler glue code is necessary for 2 reasons: 1. for bootstrapping into compiled code from the interpreter, and 2. for simplifying the setup of arguments and state for calling the C helpers. This assembler glue is obviously CPU specific.
During code execution, control of the CPU will transfer back and forth between the compiled code, the helpers, as well as to the interpreter and other VM runtime library functions. However, entry into and exit from the compiled code always goes through the CCM helpers. Another aspect of compiled code execution, the effects on the runtime stacks, is discussed in a previous article.
And when you come to the End, STOP!
To recap, CVM's JIT has many components. In the above picture, I group all these components into yellowish boxes (with some orange and red). The yellow boxes represent shared code that are common to all target platforms. The orange boxes represents the RISC JIT layer that is common to all RISC architectures. The red boxes are the CPU specific components that must be implemented for a target architecture.
If you take a step back, and just look at the boxes of yellow, orange, and red which make up the CVM JIT, you'll see that very few parts are actually CPU specific (red boxes). If your CPU architecture isn't RISC-like (as defined above), then you have the option of re-writing the entire JIT back-end though it is not advised. Rewriting the entire back-end is a daunting task. Most people would opt wisely to adapt the RISC layer for their needs instead. Regardless, this, of course, still takes more work than if the CPU architecture is RISC-like to begin with.
As mentioned above, a central design philosophy behind the CVM JIT is that we try to keep all Java VM semantics in as much of the shared code as possible. There is a need to implement some of these semantics in the production rules layer. The RISC implementation layer will encapsulate the remaining bits of these semantics. After that, all that remains are the instruction emitters which does not need to know anything about Java semantics. In order to simplify the porting effort (and reduce the chance of semantic bugs being introduced into the VM), the JIT is structured so that the porting engineer can focus mostly on the emitters. As a result, CVM's JIT has been proven to be easily portable.
On the other hand, the ability to override default implementations in the JIT allows the more sophisticated porting engineer to implement more advance optimizations to gain additional performance, but only as their development schedule permits.
Note also that this article decribes CVM's JIT. Not all JITs are structurally the same. For example, the phoneME Feature VM's JIT is very different than this.
Next entry, I'll resume the bug fix, which will take us step by step through each stage of the JIT in great detail. We will dig into the code of the IR generator amongst other things.
Till then, have a nice day. :-)