Skip to main content

CVM JIT Constant Pool Dumps

Posted by mlam on March 12, 2008 at 11:52 PM PDT

Hello World! It's been a long time ... ummm ... like 6 months since I last wrote an entry. What can I say? That's the problem with having a day job, and so far, all the ideas for things that I want to write about involves some heavy duty writing that will take up a lot of time. So, I've been putting it off. Sorry.

However, this inquiry came in today on one of my previous blog entries. Now, this, I can answer without taking up a few days of writing time. So, here you are ...

The Question

On March 12, 2008, Jamsheed Mohammed asks ...

"hi lam, Why in the JIT constant pool is the last accessed constant first and the first accessed constant emitted last, while the other way around would be a more efficient usage of ARM architectural limitation (PC relative load limitation)?"

I took some liberty with editing the comment for clarity. Jamsheed, I hope you don't mind.

What's a Constant Pool?

Anyone who has written any code would know that you will often need some constants in the course of writing code. This is no different for code written in the Java programming language. These Java constants are stored in a data structure called the Constant Pool which is part of the Java classfile. The constant pool that Jamsheed is asking about is not that constant pool. This might be obvious for some people, but just in case, it's better to be clear.

Instead, Jamsheed is referring to constants that are referenced by code that the CVM JIT compiler generates. These constants may be the same values that are fetched from the classfile constant pool, but that's beside the point.

Now, there will be more than one of these constants used in the JIT generated code. So, instead of spreading them all over the generated code, we "pool" them together i.e. we keep them in one (or a few) places. We call these places (where we pool the constants) the constant pools. This is the constant pool that Jamsheed was referring to.

Why Pool Constants?

Some CPUs (like the ARM) has a separate instruction and data cache. These are commonly referred to as the i-cache and d-cache respectively. When loading data from memory, the data first gets cached in the d-cache before it is accessible by the CPU. When loading instructions, the instruction first gets cached in the i-cache instead. So, data gets stored in the d-cache, and code gets stored in the i-cache. And constants are ... data!

Now consider what happens when you have your constants spread out throughout interlaced between all the JIT generated code. If that occurs, when you execute code that is located adjacent to the constant, the constant may also be loaded into the i-cache as well simply because it is located near the needed code.

The cache manager doesn't actually know if the bits in memory are code or data. It just loads a few words of memory at a time into the respective cache. If those few words include a constant in the midst of code being executed, the constant will get loaded into the i-cache as well. When this happens, some space in the i-cache will taken up for something (i.e. the constant) that will never be executed as code. This is inefficient use of the limited and precious i-cache space.

Similarly, when loading the constant as data, the cache manager will load a few words of memory around the constant into the d-cache. If all the words around the constant are actually code and not data, then the d-cache will now contain wasted space for words that will never be used as data.

The end result of all this is less cache locality, and that means that the code will run slower. By pooling the constants together, we lessen the probability of these kinds of i-cache and d-cache inefficiencies occurring. :)

But I digress. Now, let's get back to Jamsheed's question ...

The ARM Instruction Set

In his question, Jamsheed mentioned the ARM architecture. The reason for this is because in the ARM instruction set, load instructions can only load from an address that is located within approximately 4K from the current PC of the load instruction itself. Let's see what this means ...

Let's say you (i.e. the JIT) just emitted a load instruction to load a constant. Because you want to pool the constants together, you don't actually emit the constant yet. Instead, you keep a record of where the load instruction was emitted, and later on when you emit the constant (and therefore, finally know where it is located), you'll come back and fix up this load instruction with the proper offset for that constant.

The question is ... how do you know when to actually emit (also commonly referred to as "dump") the constant? If you dump it too early, then you may not be pooling as many constants as you possibly could, thereby increasing the cache inefficiency issue I described earlier. If you dump it too late, then the constant may be out of reach of the 4K range of the load instruction that needs to reach it.

The answer is to do periodic checks for a need to dump constants i.e. we'll dump them out into a pool whenever we feel that we may reach the 4K range limit soon. See CVMJITcpoolNeedDump() in src/share/javavm/runtime/jit/jitconstantpool.c.

Does the CVM JIT really do that?

Well, actually ... we don't dump exactly at the border of the 4K limit. This is because we can't arbitrarily dump whenever we like. For example, there are some code sequences that need to stick together and cannot afford to have a constant pool suddenly show up in its midst. Hence, the CVM JIT has to check for a need to dump whenever it is at a convenient place to dump. Right now, such places include branch sites, method invocation sites, and one other place ... which I'll explain later.

Hence, we can't actually wait till we reach the 4K limit before dumping the constants. Note that there's also a chance that we may have collected a large number of constants. When we dump the constants, each constant also further increases the offset for the next constant. Also, if we don't dump right now, we don't know when the next opportunity to dump will show up. If it shows up too late, then we'll have a JIT compilation failure. To address this issue, the CVM JIT uses a heuristic and dump whenever we reach a distance of 2K limit from the original load instruction i.e. we tradeoff some cache inefficiency to make sure that we can reach the constants from the load instructions.

By now, you might be thinking ... what a retarded JIT compiler! Surely it can do something more intelligent and inch out every last bit of offset possible between the load instruction and the constant pool dump. Well, theoretically, the JIT can do that. But in this case, we're talking about a JavaME VM JIT, and it needs to be fast and efficient i.e. the CVM JIT is not allowed to take up too much time and memory to do the compilation. Using the above heuristic is a cheap but effective trick that gets the job done without sacrificing too much performance. "Cheap" is good for embedded devices. :)

More CVM JIT Details

Well, actually ... the max offset range is even smaller than 4K. That is, if you are using the VFP hardware float instructions of the ARM. The VFP load instructions (for floating point constants) have an even more smaller range ... something like 256 bytes, if I remember correctly. So, the cache efficiency issue will be exacerbated. Anyway, the JIT's gotta do what the JIT's gotta do.

So, remember earlier when I said that there's one other place where we can dump constants? Well, that place is in between every sequence of instructions that the JIT grammar rules may emit (with some proper code to branch around the constant pool dump, of course). There is a logical break between each sequence of instructions emitted for each JIT grammar rule where we can insert a constant pool dump. Because of the small offset range of ARM VFP constants, the CVM JIT is forced to allow dumps more frequently like this.

This doesn't necessarily mean that there will always be a constant pool dump every 128 bytes of instructions or so. It only means that when there are constants to dump, you may see them show up every 128 bytes or so in the worst case. Fortunately, our benchmark data shows that performance is not impacted by this (or at least not significantly enough to be noticed).

But I am still digressing ...

The Question again

Jamsheed was asking why the CVM JIT constants are dumped from last one accessed to the first one. This obviously increases the offset distance between the first accessed constant and its corresponding load instruction.

The Answer ... finally

Well, it's simply because we were using a link list to track the constants, and inserting new constants at the head instead of the tail. There's no reason for dumping the constants this way. With very minimal work, we can change it so that the constants are dumped in a forward order rather than in the current reverse access order.

Having said that, if you take a look at the range limit issues and the heuristic that the CVM JIT has to employ to predict when the last possible opportunity to dump constants is, you may find that this optimization will have very little effect on the overall scheme of things. Yes, dumping in forward order will help. How it will help is perhaps to allow the CVM JIT to use a heuristic ratio that is less than half the max offset range (currently it is half). This will allow more constants to be pooled before we do a dump.

However, I'm not sure how much it will help and how much to change the heuristic ratio. That will be an interesting exercise to do when someone can find the time. As I've said when I started this entry, the day job is not leaving us a lot of time to play. :( Is anyone in the community willing to give this a try and report your findings? Of course, I can provide a few tips on what to do if you are interested.

Last Words

So, Jamsheed, I hope that answers your question. Thanks for your astute observation. It gave me this opportunity to share a bit about how constant pool dumps work in the CVM JIT.

Have a nice day. :)


Tags: href="http://technorati.com/tag/CVM" rel="tag">CVM href="http://technorati.com/tag/Java" rel="tag">Java href="http://technorati.com/tag/J2ME" rel="tag">J2ME href="http://technorati.com/tag/JavaME" rel="tag">JavaME href="http://technorati.com/tag/JIT" rel="tag">JIT href="http://technorati.com/tag/phoneME" rel="tag">phoneME href="http://technorati.com/tag/phoneME+Advanced" rel="tag">phoneME
Advanced rel="tag">embedded systems

Related Topics >>

Comments

Hi Mark, Thanks a lot for giving a better picture of the problem,actually i didn't consider cache locality in to account at all,now it seams more thoughts are required to continue.... i am carrying out analysis of this currently , but in a different platform(Platform with instruction size 2 byte ).Ready to contribute as and when i complete my study Thanks once again Jamsheed

Dear Mark, Hope you will not get confused about the lines "actually i didn't consider cache locality in to account at all" that ,i didn't know cache locality was reason for design of constant pool, but actually what i am trying to say is when i thought of improving heuristic ratio for maximizing the pooling i forgot to consider cache locality in to picture. also,i optimistically think , reverse constant pool design and heuristic ratio is care full design of sun engineers taking care of cache locality in to consideration.

I too have a similar doubt . Can you please clear my doubt

By fast lock contention ,i ment heap lock contention (i.e contention at fast allocation time)

Hi mark, I couldnt complete my experiments with forward constantpool as i was got shifted GC support team in prjct. Now some queries in GC . In CDC we have garbage collection invokation for fast lock contention case(From my understanding this is done for rolling the object allocation unsafe thread to gcsafe ).my question is why sholud we invoke a gc call for reaching safe point while this can be acheived by simply making try heap lock a blocking lock in gc safe window(with slight modification to gc safe window) .or by polling try heaplock with safe point after every iteration. plz rply

Thanks mlam, Some doubts in line quoted "How it will help is perhaps to allow the CVM JIT to use a heuristic ratio that is less than half the max offset range (currently it is half)".hope you meant "How it will help is perhaps to allow the CVM JIT to use a heuristic ratio that is greater than half the max offset range (currently it is half)"?? From you lines its clear that heuristic greater than half and forward constant pool implementation will give much better pooling.then why still "optimization will have very little effect on the overall scheme of things"???

Jamsheed,

Whether you increase or decrease the "half" is relative to which half you are talking about. But I think we're both in agreement that we can allow more distance between the first accessed constant and its corresponding load instruction.

As for how much effect changing the constants to be dumped in a forward order has, it depends. If we stick with the existing heuristics, then I would guess that there will be no effect. If someone is willing to do some studies with changes in the heuristics, then maybe some gains can be realized. However, the gains come only from improving the probability of having better cache locality.

How much better is that over the existing arrangement? I don't have that answer. My guess (from experience) is that this will yield limited gains in most cases. But there will be some benchmarks / applications that may be more sensitive to cache locality. You might see a bit more gain in those. Would you be willing to do the study and contribute the results?

-- regards, Mark