A Tale of Two Stacks
There's a term that I commonly use to describe CVM: embedded friendliness. Is this really a proper English term? Maybe, or maybe not. But it concisely expresses the idea that a system is suitably designed to work on embedded systems, and thereby has the property of embedded friendliness. And CVM is embedded friendly. If you are a purist of the English language, I apologize. But for the sack of conciseness, I'll continue using this term.
So today, I'll continue talking about the runtime stacks in the phoneME Advanced VM / CVM from my last entry.
Let's get started ...
Why use 2 Stacks?
Obviously, we cannot do without a native stack. The native thread depends on this stack. So, the question can be rephrased as why use a separate Java stack instead of just pushing Java frames on the native stack like JavaSE? I think the reason is because CVM was designed to be as friendly as possible for embedded OSes as well.
In the early days of CVM, linux was still uncommon in embedded systems. Most embedded OSes in those days (and still today) do not have processes, and do not provide access to a virtual memory manager (even if the hardware supports it). Hence, when you create a native thread, you will have to malloc a chunk of memory for the use of the native stack. Remember that CVM's threads are mapped directly onto native threads. One issue with having to allocate memory for the stacks is knowing exactly how much memory to allocate. If you allocate too little, applications won't run. Allocate too much, and you'll have wasted resources. For embedded devices, wasting resources is highly undesirable. Note that without virtual memory, malloc'ing the memory here means a fixed sized allocation and the memory is committed upon allocation. In turn, that means no other thread can use the memory even if the current thread doesn't need it. This is why over-allocating stack sizes results in wastage for most embedded OSes.
For classical embedded software written in C, one could do some analysis to determine max usage of stack space and get an optimum allocation of memory. For a Java platform, this is typically not possible. One of the most desirable features of the Java platform is its ability to enhance the device by allowing new or updated versions of applications to be downloaded and executed without having to re-ROM/flash the core firmware in the device. This means that the stack requirement of the application(s) is not known / computable at the time the Java VM is built.
And the solution is ...
The Separate Java stack
As previously mentioned, CVM's Java stack is chunky. This means that memory can be malloc'ed as needed. If the application needs more stack memory than the minimum, then the VM can allocate more stack chunks to fill this need. If the application does not need the extra memory, the thread operates with the minimum Java stack size.
The stack chunk size, and min and max Java stack sizes are specified in globals.c. Search for stackChunkSize, stackMinSize, and stackMaxSize respectively in the file. These 3 are specified in an example of some VM command-line options. Hence, similarly, these options can be changed on the VM command-line using -Xopt:stackChunkSize=<size>, -Xopt:stackMinSize=<size>, and -Xopt:stackMaxSize=<size> respectively.
Note that the stack chunk size specification is only used to determine the typical chunk size, and not as a limit on chunk sizes. In the event that there is a Java method which requires an abnormally large frame that doesn't fit in the size of a typical chunk (e.g. has a lot of locals, and/or needs a very deep operand stack), the VM will allocate a special chunk just for that method frame and insert it into the stack where it is needed. The old chunk which isn't big enough will be freed.
The min size will be used as the size of the first stack chunk. The max size is the upper bound on how much memory can be allocated for each Java thread.
What happens to the Native Stack?
Since Java frames will all be located on the Java stack which is growable, the native stack can be smaller, thereby minimizing memory wastage due to inaccurate size estimates.
the Interpreter's Native Frames
But how can this be? Java methods still need to be interpreted. Wouldn't we need a native stack frame for the interpreter function? Yes, we do. However, we do not need an interpreter frame for every Java method. We only enter the interpreter frame once in the native stack, and operate within that one frame while the interpreter executes Java bytecodes. The interpreter function will push and pop frames on the Java stack for the Java methods, but does not itself recurse on the native stack.
For example, when methods mIa calls mIb which in turn calls mIc (where all 3 methods are interpreted bytecode methods), the frames on the stacks will look as follows:
native stack: ... executeJava
Java stack: ... mIa -> mIb -> mIc
Note: executeJava is the abbreviation of the name of the interpreter loop function (see CVMgcUnsafeExecuteJavaMethod() in executejava_standard.c). Note also that the executeJava frame only appears once on the native stack even though there are 3 Java methods on the Java stack.
This is part of the C recursion elimination feature that is designed into CVM to make it embedded friendly. Eliminating the need for C recursion in the interpreter now enables us to shrink the usage of the C stack significantly. Another example of C recursion elimination in CVM is in the classloading system where the highly recursive part of the classloading process is moved into Java code to make use of the Java stack instead.
the Native Code's Native Frames
What about native methods? Each native method needs a native frame because it is C (or C++) code. Well, the native method has both a native and Java frame. Its Java frame is used to store object pointers (i.e. GC roots) and to track the native method in the Java call chain. The native frame is used for everything else.
When an interpreted method (mI) calls a native method (mN), the stacks will look as follows:
native stack: ... executeJava -> mN
Java stack: ... mI -> mN
But native methods can invoke another native or interpreted method through JNI. Here's what the stack looks like for native to native invocation:
native stack: ... executeJava -> mNa -> executeJava -> mNb
Java stack: ... mNa -> mNb
Here's what the stack looks like for native (mN) to interpreted (mI) invocation:
native stack: ... executeJava -> mN -> executeJava
Java stack: ... mN -> mI
Here's native (mNa) to interpreted (mIa) to native (mNb) to interpreted (mIb) invocation:
native stack: ... executeJava -> mNa -> executeJava -> mNb -> executeJava
Java stack: ... mNa -> mIa-> mNb ->mIb
Note that the interpreter loop must be on the native stack once for each native methods. This is because the interpreter loop is used to invoke a transition method which bootstraps the native method. See yesterday's discussion for details on transition frames and methods.
Therefore, invoking any method from a native method (whether native or interpreted) requires recursing into the interpreter loop i.e. executeJava. Hence, it's not 100% accurate to say that CVM never recurses into the interpreter loop. But rather that CVM eliminates a lot of interpreter loop recursion i.e. those for bytecode methods. CVM will recurse into the interpreter loop for each layer of native methods called in the call chain.
This is one reason why native methods aren't necessarily good for performance as mentioned previously. Recursion into the interpreter loop adds some overhead. However, this isn't the only reason native code can be bad for performance. More on that another day.
Why recurse the Interpreter Loop?
When the native method invokes another native method (as in a Java method that is implemented as native), the VM needs to check if that callee method is bytecode or native (or some other things: abstract, not implemented, etc). All of this logic is already in the interpreter loop function. So, rather than duplicating the functionality, CVM uses the transition frame boot-strapping to get the interpreter loop to do the checking work for us. Also, if the callee method is bytecode, then we need to recurse into the interpreter loop anyway. So, there is some advantage to just letting the interpreter loop do the method dispatch check.
Calling a C function that is not a Java method, e.g. malloc(), does not require recursing into the interpreter loop.
Detecting Stack Overflows
Since there is possible recursion in both the native and Java stack, we will need a way to ensure that we do not exceed the bounds of the stack. If code execution in a thread threatens to do that, a StackOverflowException must be thrown. For the Java stack, we always check the stack bounds before pushing a frame. We need to do this anyway to ensure that we do not overflow the stack chunks. Note that we are able to do this because the max size of a Java stack frame is defined in the class' method data, and since we are interpreting the bytecodes, we can check for potential overflow of the Java stack on every method invocation.
For native functions, it would be painful to have to implement a stack bounds check on every function to be called. Even if we can do that for the VM code, we won't be able to enforce it on any native JNI methods that a third party middleware vendor may write. JNI also does not require nor have any mechanisms to achieve this. Another issue is that the performance impact will be high.
So, for native code, we used a red zone mechanism. Though the Java application code is unpredictable at VM build time, the VM code itself is not. Hence, we can apply some static analysis to determine the max stack usage of native code (in most cases). But instead of inserting a stack check before every function call, we only insert stack checks at major recursion points. For example, see the call to CVMCstackCheckSize() (in stacks.c) from the interpreter loop, CVMgcUnsafeExecuteJavaMethod() (in executejava_standard.c).
Instead of ensuring enough stack capacity for one function call, each checkpoint ensures enough capacity for any possible call chain from that checkpoint till another checkpoint is encountered (including till the same function recurses). This checked capacity is called a red zone. An example of a red zone value is CVM_REDZONE_ILOOP which is the stack capacity value which is checked at the checkpoint in the interpreter loop. All red zone values are defined in the platform specific threads_md.h (for example, see the linux one here). The defined values are conservative estimates that should work for any CPU architecture.
If you look at the example source, you will see that all the red zone values are paded with a CVM_REDZONE_Lib_Overhead value. This value is the estimate stack usage for any given JNI method and its possible call chains (up to another red zone checkpoint, of course). The only case where we can't compute native stack usage is JNI code that is provided by a third party. Hence, we use a reasonably conservative estimate for this padding. A VM porting engineer should adjust this value if it is not conservative enough for the actual target device. However, it is always possible to write a JNI method that is pathological and will break any limit we set. But then again, when you have untrusted native methods, you'll have a lot more to worry about than unreasonable stack usage. Note: This is yet another reason to avoid using native code. Hence, the red zone padding here is meant only to provide for protection against overflow due to reasonable stack usage. It is not a guarantee against misbehaving native code. So, don't overdo the padding.
Note that since every native method is bootstrapped by an instance of the interpreter loop, adding the padding to the interpreter loop red zone value should cover the JNI method's native stack usage. But, as a cautionary (possibly paranoid) measure, the padding was added to all the other red zone values as well. Note that the red zone values are only used for checking the capacity of potential stack usage. It only ensures that we have at least as much memory left in the native stack as specified by the red zone value before embarking on a native call chain. Hence, any wastage in stack memory (due to conservative guessing of red zone values) is bounded by the size of the max red zone value. The amount of wastage is not cummulative at each red zone checkpoint.
What about compiled method frames? Also, are there efficiency / performance issues with using 2 stacks instead of 1? I'll leave these for the next article.
So, How do other VMs "stack" up?
The JavaSE VM uses a single native stack per thread. Java methods have frames residing on the native stack as well. JavaSE is usually targetted for environments that support virtual memory. Therefore, stack memory can be reserved but not committed until needed. Hence, physical memory is not wasted.
The phoneME Feature / CLDC VM implements its own pseudo-thread library for Java threads. In terms of native threads, there is only one for the VM (theoretically speaking). Instead, each Java thread get its own Java stack. The Java thread does not have a separate native stack.
Disclaimer: I am not an expert on these other VMs. I'm only telling you what I've heard i.e. they both use one stack per thread.
Time for Bed
You may be reading this during the day, but it's past midnight for me here writing it. So, it's time to stop. The article is long enough as it is anyway. In the next article, I'll talk about the remaining stack issues, and hopefully after that, I'll get into the performance issues of native code.
Have a nice day. :-)