Search |
||
CVM JIT Constant Pool DumpsPosted by mlam on March 12, 2008 at 11:52 PM PDT
Hello World! It's been a long time ... ummm ... like 6 months since I last wrote an entry. What can I say? That's the problem with having a day job, and so far, all the ideas for things that I want to write about involves some heavy duty writing that will take up a lot of time. So, I've been putting it off. Sorry. However, this inquiry came in today on one of my previous blog entries. Now, this, I can answer without taking up a few days of writing time. So, here you are ... The Question
I took some liberty with editing the comment for clarity. Jamsheed, I hope you don't mind. What's a Constant Pool? Instead, Jamsheed is referring to constants that are referenced by code that the CVM JIT compiler generates. These constants may be the same values that are fetched from the classfile constant pool, but that's beside the point. Now, there will be more than one of these constants used in the JIT generated code. So, instead of spreading them all over the generated code, we "pool" them together i.e. we keep them in one (or a few) places. We call these places (where we pool the constants) the constant pools. This is the constant pool that Jamsheed was referring to. Why Pool Constants? Now consider what happens when you have your constants spread out throughout interlaced between all the JIT generated code. If that occurs, when you execute code that is located adjacent to the constant, the constant may also be loaded into the i-cache as well simply because it is located near the needed code. The cache manager doesn't actually know if the bits in memory are code or data. It just loads a few words of memory at a time into the respective cache. If those few words include a constant in the midst of code being executed, the constant will get loaded into the i-cache as well. When this happens, some space in the i-cache will taken up for something (i.e. the constant) that will never be executed as code. This is inefficient use of the limited and precious i-cache space. Similarly, when loading the constant as data, the cache manager will load a few words of memory around the constant into the d-cache. If all the words around the constant are actually code and not data, then the d-cache will now contain wasted space for words that will never be used as data. The end result of all this is less cache locality, and that means that the code will run slower. By pooling the constants together, we lessen the probability of these kinds of i-cache and d-cache inefficiencies occurring. :) But I digress. Now, let's get back to Jamsheed's question ... The ARM Instruction Set Let's say you (i.e. the JIT) just emitted a load instruction to load a constant. Because you want to pool the constants together, you don't actually emit the constant yet. Instead, you keep a record of where the load instruction was emitted, and later on when you emit the constant (and therefore, finally know where it is located), you'll come back and fix up this load instruction with the proper offset for that constant. The question is ... how do you know when to actually emit (also commonly referred to as "dump") the constant? If you dump it too early, then you may not be pooling as many constants as you possibly could, thereby increasing the cache inefficiency issue I described earlier. If you dump it too late, then the constant may be out of reach of the 4K range of the load instruction that needs to reach it. The answer is to do periodic checks for a need to dump constants i.e. we'll dump them out into a pool whenever we feel that we may reach the 4K range limit soon. See CVMJITcpoolNeedDump() in src/share/javavm/runtime/jit/jitconstantpool.c. Does the CVM JIT really do that? Hence, we can't actually wait till we reach the 4K limit before dumping the constants. Note that there's also a chance that we may have collected a large number of constants. When we dump the constants, each constant also further increases the offset for the next constant. Also, if we don't dump right now, we don't know when the next opportunity to dump will show up. If it shows up too late, then we'll have a JIT compilation failure. To address this issue, the CVM JIT uses a heuristic and dump whenever we reach a distance of 2K limit from the original load instruction i.e. we tradeoff some cache inefficiency to make sure that we can reach the constants from the load instructions. By now, you might be thinking ... what a retarded JIT compiler! Surely it can do something more intelligent and inch out every last bit of offset possible between the load instruction and the constant pool dump. Well, theoretically, the JIT can do that. But in this case, we're talking about a JavaME VM JIT, and it needs to be fast and efficient i.e. the CVM JIT is not allowed to take up too much time and memory to do the compilation. Using the above heuristic is a cheap but effective trick that gets the job done without sacrificing too much performance. "Cheap" is good for embedded devices. :) More CVM JIT Details So, remember earlier when I said that there's one other place where we can dump constants? Well, that place is in between every sequence of instructions that the JIT grammar rules may emit (with some proper code to branch around the constant pool dump, of course). There is a logical break between each sequence of instructions emitted for each JIT grammar rule where we can insert a constant pool dump. Because of the small offset range of ARM VFP constants, the CVM JIT is forced to allow dumps more frequently like this. This doesn't necessarily mean that there will always be a constant pool dump every 128 bytes of instructions or so. It only means that when there are constants to dump, you may see them show up every 128 bytes or so in the worst case. Fortunately, our benchmark data shows that performance is not impacted by this (or at least not significantly enough to be noticed). But I am still digressing ... The Question again The Answer ... finally Having said that, if you take a look at the range limit issues and the heuristic that the CVM JIT has to employ to predict when the last possible opportunity to dump constants is, you may find that this optimization will have very little effect on the overall scheme of things. Yes, dumping in forward order will help. How it will help is perhaps to allow the CVM JIT to use a heuristic ratio that is less than half the max offset range (currently it is half). This will allow more constants to be pooled before we do a dump. However, I'm not sure how much it will help and how much to change the heuristic ratio. That will be an interesting exercise to do when someone can find the time. As I've said when I started this entry, the day job is not leaving us a lot of time to play. :( Is anyone in the community willing to give this a try and report your findings? Of course, I can provide a few tips on what to do if you are interested. Last Words Have a nice day. :) Tags: CDC CVM Java J2ME JavaME JIT phoneME phoneME Advanced embedded systems »
Related Topics >>
Virtual Machine Comments
Comments are listed in date ascending order (oldest first)
Submitted by mlam on Fri, 2008-03-14 12:38.
Jamsheed,
Whether you increase or decrease the "half" is relative to which half you are talking about. But I think we're both in agreement that we can allow more distance between the first accessed constant and its corresponding load instruction.
As for how much effect changing the constants to be dumped in a forward order has, it depends. If we stick with the existing heuristics, then I would guess that there will be no effect. If someone is willing to do some studies with changes in the heuristics, then maybe some gains can be realized. However, the gains come only from improving the probability of having better cache locality.
How much better is that over the existing arrangement? I don't have that answer. My guess (from experience) is that this will yield limited gains in most cases. But there will be some benchmarks / applications that may be more sensitive to cache locality. You might see a bit more gain in those. Would you be willing to do the study and contribute the results?
-- regards, Mark
Submitted by jamsheed_mohammed on Thu, 2008-03-13 23:10.
Thanks mlam,
Some doubts in line quoted
"How it will help is perhaps to allow the CVM JIT to use a heuristic ratio that is less than half the max offset range (currently it is half)".hope you meant "How it will help is perhaps to allow the CVM JIT to use a heuristic ratio that is greater than half the max offset range (currently it is half)"??
From you lines its clear that heuristic greater than half and forward constant pool implementation will give much better pooling.then why still "optimization will have very little effect on the overall scheme of things"???
Submitted by jamsheed_mohammed on Thu, 2008-05-08 04:04.
Hi mark,
I couldnt complete my experiments with forward constantpool as i was got shifted GC support team in prjct.
Now some queries in GC .
In CDC we have garbage collection invokation for fast lock contention case(From my understanding this is done for rolling the object allocation unsafe thread to gcsafe ).my question is why sholud we invoke a gc call for reaching safe point while this can be acheived by simply making try heap lock a blocking lock in gc safe window(with slight modification to gc safe window) .or by polling try heaplock with safe point after every iteration.
plz rply
Submitted by jamsheed_mohammed on Thu, 2008-05-08 05:04.
By fast lock contention ,i ment heap lock contention (i.e contention at fast allocation time)
Submitted by hkpottyn on Tue, 2008-05-13 02:59.
I too have a similar doubt .
Can you please clear my doubt
Submitted by jamsheed_mohammed on Wed, 2008-03-19 23:32.
Dear Mark,
Hope you will not get confused about the lines "actually i didn't consider cache locality in to account at all" that ,i didn't know cache locality was reason for design of constant pool, but actually what i am trying to say is when i thought of improving heuristic ratio for maximizing the pooling i forgot to consider cache locality in to picture.
also,i optimistically think , reverse constant pool design and heuristic ratio is care full design of sun engineers taking care of cache locality in to consideration.
Submitted by jamsheed_mohammed on Wed, 2008-03-19 23:06.
Hi Mark,
Thanks a lot for giving a better picture of the problem,actually i didn't consider cache locality in to account at all,now it seams more thoughts are required to continue....
i am carrying out analysis of this currently , but in a different platform(Platform with instruction size 2 byte ).Ready to contribute as and when i complete my study
Thanks once again
Jamsheed
|
||
|
|