Skip to main content

The BIG Picture: a Map of CVM

Posted by mlam on November 27, 2006 at 1:17 PM PST

Personally, when I dive into a new system, one of the first thing that I try to figure out is how everything fits together. If you are a visual thinker like me, one of the best ways to do that is to draw a diagram of all the things that you think are important and see how they relate to one another. In the case of embedded systems, in my experience, it is also important to know what goes where in memory, and to get a feel of how system resources are being used. Hence, I prefer to map out the data structures.

Here is my map of CVM ...

the WORLD according to CVM

Map of CVM Data Structures
Click on the map to get a popup window with a 1024 x 768 res bitmap of the map (if you want to view it in a separate window). Or click here to view the map in a PDF file. I highly recommend using the PDF if you plan to do a printout of the map.

And here's how to read the map ...the Root Data Structure
One of CVM's design criteria is to be restartable even when you run it on an OS that is not process based. Restartability without processes requires that we are able to release all malloc'ed memory. To make life easier (and it is good practice anyway), we make sure that all data is reachable from the root of a single tree of data structures in memory. This root data structure is CVMglobals which you will find at the left side of the map. You will find CVMglobals defined in globals.h here (also look for CVMGlobalState in this file) and globals.c here. Looking in CVMglobals, you will find that it is an aggregation of system global data structures. Keeping the globals in one location also makes it easier to restore the globals to a known initial state i.e. by memsetting the whole thing to 0 (after we have done proper clean up of all the subtrees, of course).

GC and the Java Heap
From the globals, you can find an embedded struct which holds GC configuration and management information (CVMglobals.gc). From this, you will be able to get to the Java heap eventually.

CVM has a pluggable GC architecture. Pluggable as in build-time pluggable, not runtime pluggable. This allows for experimental GCs to be tried out with CVM. Currently, the only product quality GC for CVM is the generational GC (see here and here for GC specific implementation files).

All Java objects, i.e. anything that extends from java.lang.Object, is allocated from the Java heap. The only exception to that is for ROMized Java objects. These reside in global data. The Java heap itself is allocated from the C heap. All other data structures are either allocated from global data (i.e. .bss, .data, or their equivalents), or from the C heap.

the JIT and Compiled Code
CVMglobals also hold the configuration and management records for the JIT (CVMglobals.jit). Traversing that tree, you will eventually find the JIT code buffer (also commonly known as the code cache). The code cache is currently fixed sized (though runtime configurable) and is allocated at VM boot time. Once it has been allocated, its size will not be changeable.

When a Java method gets compiled by the JIT, the compiler generated bits (commonly referred to as the compiled method) will reside in the code cache. The compiled method's meta-data (generated by the JIT) will also be stored in the code cache. Hence, the size of the code cache will dictate, indirectly, how many methods can be compiled.

Java Objects and Classes
When a classfile is loaded into memory, it's contents are basically parsed and organized into an optimal structure which is allocated from the C heap. This structure is called the CVMClassBlock, and it holds all the metadata of the class. The metadata includes the constantpool, class attributes, field and method information, bytecodes, etc. For each CVMClassBlock, there is one instance of java.lang.Class which will be allocated from the Java heap. Once a class has been properly loaded, these will always exist as a pair. The classblock will have a reference to the class, and vice versa. When the class is unloaded, they will both be freed effectively together.

Every Java object in CVM will have 2 words of header. The first word usually contains a pointer to the classblock. However, this header is not visible to Java code. It is only visible to the C side of the VM. Note: since java.lang.Class extends java.lang.Object, instances of Class will also have these 2 word headers.

Key files to look at are objects.h and classes.h. See here for the files.

Java Threads
In order to execute anything, the VM must have threads. Each Java thread is represented by a CVMExecEnv (also commonly referred to as an ee). In the VM, the ee is essentially the token identifier of the thread. All thread operations require the ee of the currently executing thread as a parameter. See interpreter.h here and interpreter.c here.

There is a one-to-one mapping between the ee and the java.lang.Thread instance. Once the thread is properly initialized, the 2 will always exist as a pair.

There is also a one-to-one mapping between the ee and a JNIEnv. The JNIEnv is embedded as a field within the ee. Mapping between the ee and JNIEnv addresses basically requires only an offset adjustment.

All ees are chained together in a link list. The head of this list is CVMglobals.threadList. The ee of the main thread is allocated as an embedded field in CVMglobals. The others are malloc'ed.

System Mutexes
Manipulation of the VM thread list needs to be synchronized. The same is true for many other subsystems and resources in the VM. This synchronization is normally done by using a CVMSysMutex (see sync.h here and sync.c here). There are several sysMutexes allocated at VM boot time. These mutexes are not visible to Java code, only VM C code. They are only used by VM code, not Java code.

Each sysMutex has a dedicated purpose (e.g. the CVMglobals.threadLock is for synchronizing the thread operations), and is ranked. In order to prevent deadlock, sysMutexes can only be locked in increasing rank order. When CVM is built with assertions enabled, this rank order will be asserted.

Java Execution Stack
Any thread of execution must have an execution stack. In CVM, each Java thread has 2 physical stacks: a native stack, and a Java stack. The native stack is the one that is allocated by the OS, and is used for C code execution. It holds the activation records (i.e. stack frames) of native code, and VM code including the interpreter loop function. It also holds activation frames for JIT compiled code (with a twist).

The Java stack (also known as the interpreter stack) is used to hold the activation records of Java methods. For each Java method that is executed, a frame will be pushed on this stack. Stack and frame data structures are defined in stacks.h here and stacks.c here .

If you dump a trace of the native stack when executing several Java methods, you will see stack frames for C code and the interpreter loop. If you dump a trace of the Java stack, you will only see stack frames for the Java methods that have been invoked. If you have a native method in the invocation chain, you will see a stack frame in both the native and Java stack. This is because the native method is both a C function and a Java method at the same time.

GC Roots and Root Stacks
In GC terms, CVM is called an exact VM. This means that at the time of GC, we will be able to know definitely where all the object pointers are in the system. This is in contrast with conservative GC systems which requires you to guess whether some piece of memory contains an object pointer or just some random data that resembles an object pointer.

All reachable (and therefore live) objects in the VM can be found by tracing this tree (or trees) of object references called the GC root tree. The tree starts from a root reference. These root references are essentially globals, and are usually stored in data structures called root stacks. An example of this is CVMglobals.globalRoots. Strictly speaking, these data structures need not be stacks. They are actually used as lists. However, our Java stack data structures have properties that fulfills the needs of GC root stacks nicely, and doesn't require us to write additional code (good for code efficiency). So, we just use the stacks.

If an object cannot be found by tracing the root trees, then that object is unreachable and therefore can be reclaimed by the GC.

Note that in traversing a tree, at any point in the traversal, a node can be the root of a new subtree. Hence, the term root or GC root is sometimes used to refer to object pointers / references that are found alone the way in a root scan. GC roots can be found in the root stacks, in thread execution stacks, and in object and class fields.

the End
That should be enough to give you an overall idea of how the major data structures are laid out in CVM. Note: most of the things I told you above is meant to give you a good conceptual model of the lay of the land. In practice, there will be exceptions in some cases for various reasons. Sometimes, these exceptions will break the rules. Other times, they are like extension to the rules. To keep things simple, I left out the exceptions. I may get into those when I talk about each subsystem and/or data structure specifically.

In the above, I also left out many juicy details like ... why allocated a data structure from the C heap vs the Java heap. I'll leave that for subequent discussions.

So, in the next few days (or weeks), I will zoom in on the CVM subsystems and/or data structures (one at a time), and talk about them in detail. This will include mechanical details as well as design philosophies for why things are the way they are (when relevant, of course). Again, feel free to ask questions or make requests for topics. I will try to accommodate as much as I can.

Have a nice day. :-)

Related Topics >>