A global Glassfish outage
One of the exciting things about teaching is the fact no matter how well you prepare for a class, events will always surprise you.Yesterday I was caught by surprise in the middle of a class by what seemed like a global Glassfish admin console outage.
I was teaching my Software Architecture students at IGTI how to change the default maximum thread pool size for the HTTP listener in Glassfish 3.0.1. We needed to do that in order to run a JMeter load test against our example application. The default maximum thread pool size in Glassfish is just 5 threads, which makes it impossible to stress the system with a decent number of concurrent users.
I tried to access the Glassfish admin console on http://localhost:4848 on my notebook. It started loading but hang after the "Admin Console is starting..." screen. Restarted Glassfish, no change.
So I thought: "well, my Glassfish instance must be broken somehow". Tried it on a student notebook, same thing: admin console would still failed to load.
Switched to a remote virtual desktop(Ubuntu Lucid) running on Amazon EC2, started Glassfish, tried to load management console. Same result, hangs after the starting screen.
Luckily we had a coffee break coming up so I sent the students to eat something while I tried to figure out what was going on. What could possibly have broken admin console in 4 Glassfish instances running under two operating systems (Windows and Linux) in two different countries (Brazil and USA)?
My sysadmin years taught me to look for the network whenever something stops working without any apparent reason. So I ran netstat and found some suspicious connections from the Glassfish process to *pkg.sun.com hosts. So I remembered that admin console has an auto-update feature. Maybe that had something to do with the failure to load?
To test this theory I disconnected my notebook from the network, restarted Glassfish and tried http://localhost:4848 again. Loaded promptly, worked like a charm. So I explained my students why Glassfish was behaving like that and told them to try the same solution on their notebooks. It worked for everyone so we were able to go on with the class.
What I explained to the students was this: admin console must be trying to check for updates while loading. If it cant connect to the update server (a fast kind of failure) it ignores the error and finishes loading. No problem there, only a few seconds lost. But when it can connect to the update server and the server is just not responding (a slow kind of failure) then it will wait forever for a reply and hence not finish loading.
This kind of unintended coupling is not uncommon in networked applications. But when it happens on something as big as Glassfish is gets kind of scary. I wonder how many other users where scratching their heads just then, wondering what was going on with their servers.
But this event, unpredictable as it was, presented me with the opportunity to teach one more Software Architecture lesson to my students: if your application uses a Cloud service, its better be prepared for the service to fail in all sorts of unexpected ways.