CVM JIT Constant Pool Dumps
Hello World! It's been a long time ... ummm ... like 6 months since I last wrote an entry. What can I say? That's the problem with having a day job, and so far, all the ideas for things that I want to write about involves some heavy duty writing that will take up a lot of time. So, I've been putting it off. Sorry.
However, this inquiry came in today on one of my previous blog entries. Now, this, I can answer without taking up a few days of writing time. So, here you are ...
On March 12, 2008, Jamsheed Mohammed asks ...
"hi lam, Why in the JIT constant pool is the last accessed constant first and the first accessed constant emitted last, while the other way around would be a more efficient usage of ARM architectural limitation (PC relative load limitation)?"
I took some liberty with editing the comment for clarity. Jamsheed, I hope you don't mind.
What's a Constant Pool?
Anyone who has written any code would know that you will often need some constants in the course of writing code. This is no different for code written in the Java programming language. These Java constants are stored in a data structure called the Constant Pool which is part of the Java classfile. The constant pool that Jamsheed is asking about is not that constant pool. This might be obvious for some people, but just in case, it's better to be clear.
Instead, Jamsheed is referring to constants that are referenced by code that the CVM JIT compiler generates. These constants may be the same values that are fetched from the classfile constant pool, but that's beside the point.
Now, there will be more than one of these constants used in the JIT generated code. So, instead of spreading them all over the generated code, we "pool" them together i.e. we keep them in one (or a few) places. We call these places (where we pool the constants) the constant pools. This is the constant pool that Jamsheed was referring to.
Why Pool Constants?
Some CPUs (like the ARM) has a separate instruction and data cache. These are commonly referred to as the i-cache and d-cache respectively. When loading data from memory, the data first gets cached in the d-cache before it is accessible by the CPU. When loading instructions, the instruction first gets cached in the i-cache instead. So, data gets stored in the d-cache, and code gets stored in the i-cache. And constants are ... data!
Now consider what happens when you have your constants spread out throughout interlaced between all the JIT generated code. If that occurs, when you execute code that is located adjacent to the constant, the constant may also be loaded into the i-cache as well simply because it is located near the needed code.
The cache manager doesn't actually know if the bits in memory are code or data. It just loads a few words of memory at a time into the respective cache. If those few words include a constant in the midst of code being executed, the constant will get loaded into the i-cache as well. When this happens, some space in the i-cache will taken up for something (i.e. the constant) that will never be executed as code. This is inefficient use of the limited and precious i-cache space.
Similarly, when loading the constant as data, the cache manager will load a few words of memory around the constant into the d-cache. If all the words around the constant are actually code and not data, then the d-cache will now contain wasted space for words that will never be used as data.
The end result of all this is less cache locality, and that means that the code will run slower. By pooling the constants together, we lessen the probability of these kinds of i-cache and d-cache inefficiencies occurring. :)
But I digress. Now, let's get back to Jamsheed's question ...
The ARM Instruction Set
In his question, Jamsheed mentioned the ARM architecture. The reason for this is because in the ARM instruction set, load instructions can only load from an address that is located within approximately 4K from the current PC of the load instruction itself. Let's see what this means ...
Let's say you (i.e. the JIT) just emitted a load instruction to load a constant. Because you want to pool the constants together, you don't actually emit the constant yet. Instead, you keep a record of where the load instruction was emitted, and later on when you emit the constant (and therefore, finally know where it is located), you'll come back and fix up this load instruction with the proper offset for that constant.
The question is ... how do you know when to actually emit (also commonly referred to as "dump") the constant? If you dump it too early, then you may not be pooling as many constants as you possibly could, thereby increasing the cache inefficiency issue I described earlier. If you dump it too late, then the constant may be out of reach of the 4K range of the load instruction that needs to reach it.
The answer is to do periodic checks for a need to dump constants i.e. we'll dump them out into a pool whenever we feel that we may reach the 4K range limit soon. See CVMJITcpoolNeedDump() in src/share/javavm/runtime/jit/jitconstantpool.c.
Does the CVM JIT really do that?
Well, actually ... we don't dump exactly at the border of the 4K limit. This is because we can't arbitrarily dump whenever we like. For example, there are some code sequences that need to stick together and cannot afford to have a constant pool suddenly show up in its midst. Hence, the CVM JIT has to check for a need to dump whenever it is at a convenient place to dump. Right now, such places include branch sites, method invocation sites, and one other place ... which I'll explain later.
Hence, we can't actually wait till we reach the 4K limit before dumping the constants. Note that there's also a chance that we may have collected a large number of constants. When we dump the constants, each constant also further increases the offset for the next constant. Also, if we don't dump right now, we don't know when the next opportunity to dump will show up. If it shows up too late, then we'll have a JIT compilation failure. To address this issue, the CVM JIT uses a heuristic and dump whenever we reach a distance of 2K limit from the original load instruction i.e. we tradeoff some cache inefficiency to make sure that we can reach the constants from the load instructions.
By now, you might be thinking ... what a retarded JIT compiler! Surely it can do something more intelligent and inch out every last bit of offset possible between the load instruction and the constant pool dump. Well, theoretically, the JIT can do that. But in this case, we're talking about a JavaME VM JIT, and it needs to be fast and efficient i.e. the CVM JIT is not allowed to take up too much time and memory to do the compilation. Using the above heuristic is a cheap but effective trick that gets the job done without sacrificing too much performance. "Cheap" is good for embedded devices. :)
More CVM JIT Details
Well, actually ... the max offset range is even smaller than 4K. That is, if you are using the VFP hardware float instructions of the ARM. The VFP load instructions (for floating point constants) have an even more smaller range ... something like 256 bytes, if I remember correctly. So, the cache efficiency issue will be exacerbated. Anyway, the JIT's gotta do what the JIT's gotta do.
So, remember earlier when I said that there's one other place where we can dump constants? Well, that place is in between every sequence of instructions that the JIT grammar rules may emit (with some proper code to branch around the constant pool dump, of course). There is a logical break between each sequence of instructions emitted for each JIT grammar rule where we can insert a constant pool dump. Because of the small offset range of ARM VFP constants, the CVM JIT is forced to allow dumps more frequently like this.
This doesn't necessarily mean that there will always be a constant pool dump every 128 bytes of instructions or so. It only means that when there are constants to dump, you may see them show up every 128 bytes or so in the worst case. Fortunately, our benchmark data shows that performance is not impacted by this (or at least not significantly enough to be noticed).
But I am still digressing ...
The Question again
Jamsheed was asking why the CVM JIT constants are dumped from last one accessed to the first one. This obviously increases the offset distance between the first accessed constant and its corresponding load instruction.
The Answer ... finally
Well, it's simply because we were using a link list to track the constants, and inserting new constants at the head instead of the tail. There's no reason for dumping the constants this way. With very minimal work, we can change it so that the constants are dumped in a forward order rather than in the current reverse access order.
Having said that, if you take a look at the range limit issues and the heuristic that the CVM JIT has to employ to predict when the last possible opportunity to dump constants is, you may find that this optimization will have very little effect on the overall scheme of things. Yes, dumping in forward order will help. How it will help is perhaps to allow the CVM JIT to use a heuristic ratio that is less than half the max offset range (currently it is half). This will allow more constants to be pooled before we do a dump.
However, I'm not sure how much it will help and how much to change the heuristic ratio. That will be an interesting exercise to do when someone can find the time. As I've said when I started this entry, the day job is not leaving us a lot of time to play. :( Is anyone in the community willing to give this a try and report your findings? Of course, I can provide a few tips on what to do if you are interested.
So, Jamsheed, I hope that answers your question. Thanks for your astute observation. It gave me this opportunity to share a bit about how constant pool dumps work in the CVM JIT.
Have a nice day. :)
Advanced rel="tag">embedded systems