Automatic Crash Reporting
It seems that more often than not, logging fails to garner the respect it deserves. Many applications weave megabytes of useful debug information from a combination of thousands of successful and failed requests into a single log file. Tools help filter the result to some degree but can't help much when the production application forgoes debug level logging for the sake of performance. How many times have you troubleshot a production issue sans adequate information?
My current web application made due during development with a simple default error page that printed the exception stack trace to the browser. Testers would copy the stack trace and enter a short description into Bugzilla. In the past, when the production deadline rolled around, we would typically modify the error handler to display a generic error message and unique ID in place of the stack trace and fire off an e-mail to an administrator. This time we decided to turn things up a notch. First, throwing away useful debug level log messages seemed like an enormous waste. If dumping the messages from separate requests to one place produces a bottleneck, why not keep them separate? Second, filing bugs and filtering out duplicates struck me as unnecessarily rote. How could we better automate the process?
Inspired by an entry in Marick's blog, I built a crash reporting framework (nicknamed "Bobzilla" by a coworker). The simple implementation does not impact application code; we combine a custom Servlet filter and a custom log4j appender to capture the log messages for the scope of a request in a thread local buffer. When the filter catches an exception, it creates a new bug in Bugzilla and uses the log messages leading up to the exception and the exception's stack trace as the bug description.
Integrating with Bugzilla proved easier than we expected. Tired of waiting for permission to directly access the Bugzilla database, I decided to post bugs directly to the Bugzilla web application. I discovered through experimentation that Bugzilla lets you pass the user ID and password along with the rest of the parameters, so authentication was a snap. Integration amounted to looking at the HTML source for the "New Bug" page to see what parameters it passed in and duplicating the effort using a
We filter out duplicate exceptions by hashing the stack trace. Unfortunately, even after some experimentation and tweaking, a couple duplicates still make it through, however any further filtering must be application specific and would require more development effort than simply invalidating duplicate bugs by hand. We're still much better off than where we started.
With "Bobzilla" in place for a few weeks now, the turnaround time for addressing bugs has dropped considerably, testers focus more time identifying functionality issues, fewer problems slip through the cracks, we collect debug level messages in production with no performance penalty, and I no longer waste time tailing logs and filtering the noise caused by ten concurrent requests. As a next step, I think I might implement a dynaop interceptor that does for our service layer as the servlet filter does for our web application.