Skip to main content

In-situ simulation reduces the pain of debugging distributed systems.

Posted by mranga on January 7, 2004 at 6:12 AM PST

I've spent an inordinate amount of time debugging distributed protocol stacks and applications. Building distributed systems / protocol stacks is a tricky affair. It takes a lot of time and patience and testing to get it all right and then some. Reproducing bugs in such systems is tough. Building scalable test frameworks is tough. One normally resorts to looking at event logs and traces and the like and I have lamented in another blog about things of that nature so I shall refrain from lamenting again. Now theres another technique which I have found quite handy which is the subject of this blog.

Discrete event Simulation is used as a paper generator by the academic community in but I have used it for debugging. Using java, one can build an in-situ simulation environment that allows you replace java API's with simulated version of the same. Then run your entire distributed system in the simulated environment and reproduce error conditions/ timing and synchronization bugs. When you are ready to ship, just replace the simulated calls with the real thing and you are ready to go. How is this done?

Java has some very cute features as we all no doubt agree and one of these is dynamic loading - which means that one can take an existing distributed application written to the socket interface for example that uses java threads and processes and replace all calls to the socket interface by calls to a simulated socket interface. One can replace all calls to process and thread creations by calls to simulated processes /thread creations and so on. In effect one can build an insitu simulation environment with ease. If source code is not available, one can even use bytecode rewriting to accomplish this (assuming that the code has not been obfuscated). Using such a simulated environment one can:

  • Vary latency, packet loss and other network characteristics and test for protocol bugs.
  • Vary simulated delay in critical code paths and test for deadlocks and other delights.
  • Test for scalability of the implementation and even for memory leaks and such.

In other words simulation is a great way of re-producing bugs and chasing down errors which are otherwise really hard to deal with. It is also a nice way of checking for scalability of your implementation. There are several simulation frameworks for java out there but few that attempt to replace the java API with a simulated Java API .
Neko SIM is a cool project that tries to do this. I've launched a less ambitious project on called java-socket-simulator (see JAVA Socket Simulator ) that tries to just replace the socket layer and simulates processes and threads. I've used this successfully in the jain-sip project ( ). Now, with the mere flick of build flag and the adroit application of a pre-processor, you can build jain sip as a discrete event simulator or the real thing. I've used it to uncover some non-obvious synchonization bugs. ( I found it handy to use a pre-processor; in this connection, see my rants on another weblog). If I feel brave enough I might just do something foolish like simulate the whole jxta system using this approach. ( We have a 2 gig main memory machine in the closet.)

This is not a new idea, - for example, see the x-kernel work but java, the availability of bytecodes and its virtual machine approach make it easier to accomplish.

Related Topics >>