The Source for Java Technology Collaboration
User: Password:



Eamonn McManus

Eamonn McManus's Blog

Disassembling serialized Java objects

Posted by emcmanus on June 12, 2007 at 06:51 AM | Comments (5)

Presenting Serialysis, a library that allows you to disassemble the serial form of Java objects. This can allow you to retrieve information about an object that is not available through its public API. It is also a useful tool when testing the serialization of your classes.

When the public API is not enough

My reason for writing this library is that I encountered a couple of problems where I found that I needed information from an object that was not available through its public API, but that was available through its serial form.

One example is if you have a stub for a remote RMI object, and you want to know what address it will connect to, or what port, or using what socket factory. The standard RMI API doesn't give you any way to extract this information from the stub. But the information is there, and it must be included when the stub is serialized so that the stub is usable when it is later deserialized. So if we could somehow parse the serialized stub we could get the information we want.

A second example comes from the JMX API. Queries to the MBean Server are represented by the interface QueryExp. QueryExp instances are constructed using the methods of the Query class. If you have an object implementing QueryExp, how can you know what query it executes? The JMX API doesn't include any method to find out. The information must be present in the serial form, so that when a client sends a query to a remote server it can be reconstituted on the server. If we could look at the serial form, we could find out what the query was.

This second example is what prompted me to write this library. The existing standard JMX connectors are based on Java serialization, so they don't need to do anything special for QueryExps. But the new Web Services Connector being defined by JSR 262 uses XML for serialization. How can it analyze a QueryExp in order to convert it into XML? The answer is that the WS Connector uses a version of this library to look at the Java-serialized QueryExp.

What these examples have in common is that they illustrate gaps in the relevant APIs. There ought to be methods that allow you to extract the information contained in an RMI stub. There ought to be methods that convert back from a QueryExp object to the original Query methods that constructed it. (Even a standardized parseable toString() would be enough.) But those methods aren't there today, and if we want code that works with those APIs as they are now, we need another approach.

Grabbing the private fields of objects

If you have the source code of the classes you're interested in, it's tempting just to barrel in and grab the information you need. In the RMI stub example, we can find out by experiment that the stub's getRef() method returns a sun.rmi.server.UnicastRef, and by studying the JDK source we might be able to figure out that this class contains a field ref of type sun.rmi.transport.LiveRef with the information we need. So we might end up with code like this:

// This is NOT a good idea!!!

import sun.rmi.server.*;
import sun.rmi.transport.*;
import java.rmi.*;
import java.rmi.server.*;

public class StubDigger {
    public static getPort(RemoteStub stub) throws Exception {
        RemoteRef ref = stub.getRef();
    	UnicastRef uref = (UnicastRef) ref;
    	Field refField = UnicastRef.class.getDeclaredField("ref");
    	refField.setAccessible(true);
    	LiveRef lref = (LiveRef) refField.get(uref);
    	return lref.getPort();
    }
}

You might be satisfied with this, but you shouldn't be. The code in bold is full of horrors. First of all, you should never depend on sun.* classes, because there's no guarantee they won't change unrecognizably in any JDK update, plus of course your code probably won't be portable to platforms other than the JDK. Secondly, it's a huge red flag when you see Field.setAccessible being called. That means the code is depending on undocumented fields, which again could change between releases, or, worse, which might continue to exist but with subtly different semantics.

(The above code was written for JDK 5. It turns out that in JDK 6, LiveRef acquires a public getPort() method, so you no longer need Field.setAccessible. But you still need to depend on sun.* classes.)

Well, sometimes you can't do any better than this. But if the class you're interested in is serializable, often you can. The reason is that the serial form of a class is part of its public interface. If the API is any good at all then its public interfaces will evolve compatibly in every update. This is a very strong requirement on the JDK platform in particular.

So if the information you need isn't available through a class's public methods, but is part of the documented serial form, then you can rely on it remaining in the serial form in the future.

The serial form is included in the Javadoc output as part of the See Also for each serializable class. You can see the serial form of all public JDK classes in a single giant page.

Enter Serialysis

My library to parse serialized objects is called Serialysis, the result of cramming the words "serial analysis" too close together.

Here's a simple example of what it looks like in action. This code...

    	SEntity sint = SerialScan.examine(new Integer(5));
	System.out.println(sint);

...produces this output...

SObject(java.lang.Integer){
  value = Prim(int){5}
}

This tells us that the java.lang.Integer that we gave to SerialScan.examine serializes as an object with a single field value of type int. If we check out the documented serialized form of java.lang.Integer we can see that this is indeed what is expected.

If you check out the source code of java.lang.Integer, you'll see that the class itself also has a single field value of type int:

    /**
     * The value of the <code>Integer</code>.
     *
     * @serial
     */
    private final int value;

But private fields are an implementation detail. An update could rename this field, or replace it with a new field inherited from the parent class java.lang.Number, or whatever. There's no guarantee that that won't happen, but there is a guarantee that the serial form will remain the same. Serialization provides mechanisms to keep the serial form the same even when the class's fields change.

Here's a more complicated example. Suppose that, for some reason, we want to know how big the array in an ArrayList is. The API doesn't allow us to find out, though it does allow us to force the array to be at least a certain size.

If we check the serial form of ArrayList, we see that it does contain the information we're looking for. There's a serialized field size, which is the number of elements in the list. That's not what we want. But the Serial Data in the writeObject method does have what we want:

Serial Data:
The length of the array backing the ArrayList instance is emitted (int), followed by all of its elements (each an Object) in the proper order.

If we execute this code...

	List<Integer> list = new ArrayList<Integer>();
	list.add(5);
	SObject slist = (SObject) SerialScan.examine(list);
	System.out.println(slist);

...we get this output...

SObject(java.util.ArrayList){
  size = SPrim(int){1}
  -- data written by class's writeObject:
  SBlockData(blockdata){4 bytes of binary data}
  SObject(java.lang.Integer){
    value = SPrim(int){5}
  }
}

This is where we get into the gory details of serialization. In addition to, or instead of, serializing an object's fields, its class can declare a method writeObject(ObjectOutputStream) that writes arbitrary data to the serial stream using methods like ObjectOutputStream.writeInt. It must declare a corresponding readObject that reads the same data, and it should document via a @serialData tag what the writeObject method writes, as ArrayList does.

The writeObject data is accessible in Serialysis through the method SObject.getAnnotations(), which returns a List<SEntity>. Each Object that was written via the method ObjectOutputStream.writeObject(Object) appears as an SObject in this list. Each chunk of data written by one or more consecutive calls to the methods that ObjectOutputStream gets from DataOutput (writeInt, writeUTF, etc) appears as an SBlockData. The serial stream doesn't include enough information to separate out individual items within the chunk; that information is an agreement between writer and reader that is documented by the @serialData tag.

Based on the ArrayList documentation, we can find the size of the array like this:

	SObject slist = (SObject) SerialScan.examine(list);
	List<SEntity> writeObjectData = slist.getAnnotations();
	SBlockData data = (SBlockData) writeObjectData.get(0);
	DataInputStream din = data.getDataInputStream();
	int alen = din.readInt();
	System.out.println("Array length: " + alen);

How Serialysis solves my example problems

Without showing all the details of the code, here's the outline of the solution to the QueryExp problem I mentioned. Suppose I have a QueryExp constructed like this:

QueryExp query =
    Query.or(Query.gt(Query.attr("Version"), Query.value(5)),
	     Query.eq(Query.attr("SupportsSpume"), Query.value(true)));

This means, "MBeans where the Version attribute is greater than 5 or the SupportsSpume attribute is true. The toString() of this query in the JDK looks like this:

((Version) > (5)) or ((SupportsSpume) = (true))

The result of SerialScan.examine looks like this:

SObject(javax.management.OrQueryExp){
  exp1 = SObject(javax.management.BinaryRelQueryExp){
    relOp = SPrim(int){0}
    exp1 = SObject(javax.management.AttributeValueExp){
      attr = SString(String){"version"}
    }
    exp2 = SObject(javax.management.NumericValueExp){
      val = SObject(java.lang.Long){
        value = SPrim(long){5}
      }
    }
  }
  exp2 = SObject(javax.management.BinaryRelQueryExp){
    relOp = SPrim(int){4}
    exp1 = SObject(javax.management.AttributeValueExp){
      attr = SString(String){"supportsSpume"}
    }
    exp2 = SObject(javax.management.BooleanValueExp){
      val = SPrim(boolean){true}
    }
  }
}

You can imagine code that descends into this structure producing an XML equivalent. Every conformant implementation of the JMX API is required to produce this same serial form, so the code that parses it is guaranteed to work everywhere.

Now here's the code that solves the RMI stub port number problem:

    public static int getPort(RemoteStub stub) throws IOException {
	SObject sstub = (SObject) SerialScan.examine(stub);
	List<SEntity> writeObjectData = sstub.getAnnotations();
	SBlockData sdata = (SBlockData) writeObjectData.get(0);
	DataInputStream din = sdata.getDataInputStream();
	String type = din.readUTF();
	if (type.equals("UnicastRef"))
	    return getPortUnicastRef(din);
	else if (type.equals("UnicastRef2"))
	    return getPortUnicastRef2(din);
	else
	    throw new IOException("Can't handle ref type " + type);
    }

    private static int getPortUnicastRef(DataInputStream din) throws IOException {
	String host = din.readUTF();
	return din.readInt();
    }

    private static int getPortUnicastRef2(DataInputStream din) throws IOException {
	byte hasCSF = din.readByte();
	String host = din.readUTF();
	return din.readInt();
    }

To understand this, you need to see the serial form for RemoteObject. This code is admittedly difficult, but it is portable and futureproof. It should be fairly clear how to extract the other information I mentioned from RMI stubs using the same approach.

Conclusions

You really don't want to get into disassembling serial forms unless you have to. But if you do have to, then Serialysis should make your task a little less painful.

It's also a good way to check that your own classes serialize the way you expect them to.

Download

You can download the Serialysis library at http://weblogs.java.net/blog/emcmanus/serialysis.zip.

[Tags: , , .]


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • "My reason for writing this library is that I encountered a couple of problems where I found that I needed information from an object that was not available through its public API, but that was available through its serial form."

    I think the last part of this sentence is not an actual requirement. It is a possible solution but definitely not the most appropriate in my honest opinion. Why did you discount reflection in favor of serialization especially in light of the obvious difference between the object runtime representation and serialized format? Surely the runtime field state would resemble much closer the object 's private state you needed. Have I missed something?

    We have managed to display complex JMX attribute values within our performance monitoring and problem management console without the need to resort to intercepting ObjectOutput/IO calls. This feature was made available last June.

    Blog Entry: JMX JVMInsight with Object Field State
    Screencast: Insight Extensions

    William Louth
    JXInsight Product Architect
    CTO, JINSPIRED

    Posted by: wlouth on June 13, 2007 at 11:53 AM

  • William,

    Well, it's horses for courses. Finding the private fields of an object via reflection is appropriate if you are going to show them to a human. This is what you're doing with JXInsight if I understand your blog correctly. You don't care much if an update to the class renames the fields or changes their semantics, because the person looking at your console will still be able to figure out what's going on.

    On the other hand, if the information is being accessed by a program then you do care if it's reorganized. An update to a class has every right to change the private fields beyond recognition, which will instantly break any program that was based on the fields in the older version. But, as I was saying, the serial form is part of the public interface of the class, so it cannot be changed in an update, assuming compatibility is respected. So a program that relies on the serial form is more robust in the face of updates than one that relies on the private fields.

    Posted by: emcmanus on June 14, 2007 at 09:07 AM

  • Dear Eamonn McManus,

    I found this serialysis tool very useful, although what I need is a bit more. I would like not only to parse and analyse serialized objects, but also generate them.

    We have a TestManager tool that is able to record the client-server communication, and then, based on the gathered information, generate requests for the server and accept responses, on behalf of the client. It works well both for HTTP and socket communication, but it is the first time we have to work with content in the form of serialized objects. We have to modify some object fields before passing the request to the server, that is why we need some kind of generator.

    Do you have any knowledge about such tool?

    -Gabor

    Posted by: gumann on September 02, 2007 at 03:48 AM

  • Gabor,

    Unfortunately I am not aware of any tool that does what you want. You could try modifying Serialysis so that you can modify the various SObject etc classes and reserialize them, but that is a pretty big task. Some small modifications are possible by searching through the binary serialized data for the exact value you want to modify, and replacing those bytes with the new value. This has to be done with care and will be fragile, but it might be acceptable for a test framework. (If it breaks then presumably tests will fail spuriously, which you can fix.) For example you can rather safely replace a string with another string of the same length using this approach.

    Another possibility would be to deserialize the objects within your TestManager, modify them, then reserialize them before forwarding to the original recipient. The modification could access private fields in the way I described at the very start of this entry; again in a test framework this might be acceptable.

    Éamonn

    Posted by: emcmanus on September 03, 2007 at 06:16 AM

  • Dear Eamonn,

    I have listened to the communication between jconsole and the naming Port of a jmx-Configuration n tomcat.
    The IP Address and the port of the rmi server are easy to see.
    But what does the code behind the port number mean?
    I got the annotation value and finally the portnumber, but what's behind it and how to interpret it?
    I looked in the web, but I didn't manage to find any good explaination, about what the remote object's reference looks in its binary representation.

    It would be very kind, if you could give me a hint.

    Kind regards,

    Bernhard

    PS.: the wireshark scan (jmx starts with offset 16):
    00000000 4e 00 0e 31 30 2e 31 30 30 2e 31 30 30 2e 31 30
    00000010 31 00 00 04 a5
    00000015 51 ac ed 00 05 77 0f 01 a4 f0 7f 08 00 00 01 14
    00000025 e6 dd 7c 6f 85 5f 73 72 00 2e 6a 61 76 61 78 2e.
    00000035 6d 61 6e 61 67 65 6d 65 6e 74 2e 72 65 6d 6f 74
    00000045 65 2e 72 6d 69 2e 52 4d 49 53 65 72 76 65 72 49
    00000055 6d 70 6c 5f 53 74 75 62 00 00 00 00 00 00 00 02
    00000065 02 00 00 70 78 72 00 1a 6a 61 76 61 2e 72 6d 69
    00000075 2e 73 65 72 76 65 72 2e 52 65 6d 6f 74 65 53 74
    00000085 75 62 e9 fe dc c9 8b e1 65 1a 02 00 00 70 78 72
    00000095 00 1c 6a 61 76 61 2e 72 6d 69 2e 73 65 72 76 65
    000000A5 72 2e 52 65 6d 6f 74 65 4f 62 6a 65 63 74 d3 61
    000000B5 b4 91 0c 61 33 1e 03 00 00 70 78 70 77 37 00 0a
    000000C5 55 6e 69 63 61 73 74 52 65 66 00 0e 31 30 2e 31
    000000D5 30 30 2e 31 30 30 2e 31 30 32 00 00 04 21 36 f2
    000000E5 b4 5c 3c 5f 6e fa a4 f0 7f 08 00 00 01 14 e6 dd
    000000F5 7c 6f 80 00 01 78
    000000FB 53

    Posted by: bernyspeedy on September 12, 2007 at 12:15 AM



Only logged in users may post comments. Login Here.


Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds