Debugging tales - 'Improbable' != 'Impossible'
This week my team re-learned the debuggers' mantra: "What you think is improbable probably isn't as improbable as you thought.".
A horrible performance problem cropped up in an upcoming release of our web-based product last Thursday. Each page of the application would take almost thirty seconds to load. This is not a good thing.
Embarrassingly, this problem was hard to track down because we had been sloppy. In our haste to wrap things up we had thrown caution to the wind and short-circuited our normally conservative change management process. We'd had a major database change, experienced a problem with our LDAP server, and deployed a significant cleanup of our security code in the preceding days. While all these headaches were being addressed, we either didn't notice the problem when it was introduced, or more likely we chalked it up to one of our known issues. Added to this chaos, the problem only manifested itself on the systems that we use for functional and acceptance testing. On our development machines everything was copasetic. Everything passed both unit and regression tests.
So how did we track down this culprit? We started by looking at the app-servers, in our case WebLogic. Memory and CPU utilization looked fine; there were some "suspicious" processes hanging around, but nothing truly onerous. One developer then spent fruitless hours searching through logs to find a clue. Completely out of luck, we reverted to wishful thinking: "I'll bet the problem is..."
For half a day, we convinced ourselves that the previously mentioned LDAP problem was the culprit. Probably not a bad guess, but we did not pursue other avenues while we waited. Once the LDAP problem was resolved, we eliminated a possible database issue by direct investigation. This did not take long, the problem wasn't LDAP or database related.
Now the plot thickens... We identified an EAR that had been extensively refactored, and discussed among ourselves if this could be the culprit. Our unanimous consensus was that the EAR could not be the cause of the behavior we were experiencing, so we decided to move on. It was late Friday afternoon, so we called it quits until Monday.
On Monday morning, we began the process of installing Quest Software's Performasure on the servers in question (something we had intended to do months ago, but never got around to). We've used Performasure before, we had just never installed it on the QA servers.
By itself, installing Performasure is not a time consuming process, but since another group was involved in configuring the servers the task was going to take several hours.
For some reason, while waiting on the installation I decided to deploy the "old" version of the EAR that we had dismissed on Friday. You guessed it, the EAR that couldn't possibly be the problem proved to be the culprit. Once identified the code was repaired in a couple of hours.
There are several morals to this tale, but the one I want to remember is this:
Try the easy things rather than discussing them.
Sherlock Holmes' mantra was:
"When you have eliminated the impossible, whatever remains, however improbable, must be the truth."
The rub is in determining what is improbable rather then impossible. Knowing your system well can be a double-edged sword. On the one hand it quickly helps you identify probable causes, but on the other it can lead you to dismiss options too quickly. I think that's what happened to us. We talked ourselves through a trivial test rather then performing the test.
Maybe I'll remember this next time...
(Cross posted at The Thoughtful Programmer)