Skip to main content

Logging of causal distributed application behavior

Posted by mranga on December 21, 2003 at 6:49 PM PST

I've been hacking at building distributed applications for some time now (both in Java and C++) and, because I was invented before debuggers were invented, I dont use debuggers. Besides my code tends to be multi-threaded and messy and open source and bugs occur on other people's machines that I cannot access. Thus, my code carries its own debugging facility, which in my bag of tricks means liberal logging. I usually resort to :

  • Writing huge log files
  • . In java I do new Exception().printStackTrace liberally in order to capture runtime execution.
  • . Staring at multiple log filesgenerted at multiple nodes and correlating log files to source code until I become quite bad tempered and cross eyed.

So how do we improve the situation?

Theres things like log4j that nicely write things to log files although I am not fascinated by the mere ability to write to a file. However, I have often thought it would be very very cool to have the ability to do causal (vector) timestamps on messages so that when a socket writes out a message for example, if logging is enbabled, the outgoing message can be stamped with a causal clock. ( A vector clock is simply a clock with N fields for N nodes, each node increments its counter when it sends/receives a message. A vector clock allows you to determine relative ordering of events in a distributed application. Its actually a very old idea employed in distribtued systems such as ISIS.)

Then the log files from different nodes can be gathered and one can re-play execution in causal sequence. What with the nice introspection features of java, one might even be able to record and replay a distributed system. Now, protocols like SIP (with which I've been hacking of late), do capture causality in message sequences through the notions of transactions, sequence nunbers etc and we bult some nice log file viewing tools around it, but is there a more general way to do this using vector timestamps? So where is this leading ?

  • 1. Lets add a standard log format to capture causality in distributed applications.
  • 2. Lets build tools for correlation of distributed logs without needing to synchronize wall clock time.
  • and finally, I would love to not employ the infamous new Exception().printStackTrace to capture stack traces in log files (which invariably frightens unsuspecting users). It would be nice to have a better means of doing this (perhaps there is and I dont know about it).