Skip to main content

Testus Interruptus or how to avoid InterruptedExceptions in JCK tests

Posted by simonis on November 13, 2007 at 12:12 PM PST

Although JCK tests are not a regression test suite it is probably not uncommon that they (or at least a significant subset of them) are used as automated tests. To do this successfully, a number of JCK tests like for example interactive tests or tests which require a special setup have to be excluded from the test suite. This can be easily achieved with the help of exclude lists.

However, even with such a setup, one still encounters infrequent, spurious test failures in random tests. Usually, these failures are not reproducible but still need a lot of attention in order to ensure that they do not signal a real problem with the tested VM.

After having analyzed such failed tests over a longer period of time I realized that most of them have been caused by an InterruptedException that was not supposed to be catched within that test. Notice that while some tests explicitely report that the reason of a failure is a catched InterruptedException, other tests fail silently. In such cases, only a closer look at the test sources revealed the fact, that most of them use to call one of the Object.wait(), Thread.join() or Thread.sleep() methods which can all be interrupted and throw an InterruptedException.

The only strange thing was that I couldn't identify any single call to Thread.interrupt() during the execution of these failed tests. This observation led to the assumption, that the interrupts must originate from the test agent or from the harness. Because the failures where so seldom and occurred in random tests, I assumed that they could be provoked by timeouts caused by a high machine load. But decreasing the JCK timeout factor just led to more tests that where aborted. Such test are flagged with an "Error" status in contrast to failed test which return in time but don't return the expected result (they are flagged as "Failed"). So that obviously wasn't the solution.

Finally, I decided to try it the hard way and instrumented Thread.interrupt() and the two constructors of the InterruptedException class to print a timestamp and a stack trace every time they were called. After running the JCK tests with the modified VM I could identify some interesting calls to Thread.interrupt():

java.lang.Throwable: Thread.interrupt() called for Thread[Agent0,3,main]
        at java.lang.Thread.interrupt(Thread.java:821)
        at com.sun.javatest.agent.SocketConnection$1.timeout(SocketConnection.java:120)
        at com.sun.javatest.util.Timer$1.run(Timer.java:84)

They only occurred in JCK runs with failed tests, but interestingly enough, they sometimes happened long (up to 30 minutes) before the failing test. After understanding the semantics of Thread.interrupt(), which only sets the status of a thread to "interrupted" but doesn't actively interrupt the thread, it was clear why - the corresponding thread gets interrupted for a yet still unknown reason, but the interrupted status of the thread is not cleared. Afterwards, the first test which calls one of the interruptible Object.wait(), Thread.join() or Thread.sleep() methods, will instantly raise an InterruptedException and fail badly.

So I just had to find out, why somebody would want to interrupt the test thread, why this interruption happens so randomly and probably most interesting: why isn't this interrupt handled by the test that provokes the interrupt. After downloading the JTHarness sources from https://jtharness.dev.java.net/ and digging into the code, I came up with the following explanation:

Below you can see a simplified view of how an agent handles request from the harness (for the full story see com/sun/javatest/agent/Agent.java). For every test, the agent establishes a new connection (an object of type SocketConnection) to the test harness. It receives the test name and parameters from the connection, executes the test and returns the result back through the connection.

Listing 1: Agent in pseudocode
Agent.handleRequestsUntilClosed() {
  while(!closing) {
    Connection connection = nextConnection();
    Task t = new Task(connection);
    t.handleRequest();
      // Task in pseudocode
      connection.in.readRequest();
      status = execute();
      connection.out.sendStatus(status);
      connection.out.flush();
      connection.waitUntilClosed(5000/*ms*/);
      close();
        // Task.close in pseudocode
        if (!connection.isClosed()) {
          connection.close();
}
  }
}

The interesting part here is the call to connection.waitUntilClosed(). The method waitUntilClosed() is declared in the Connection interface in com/sun/javatest/agent/Connection.java to potentially throw an InterruptedException. However, the implementation of the interface in com/sun/javatest/agent/SocketConnection.java doesn't throw any exceptions. It works as follows:

Listing 2: waitUntilClosed() in pseudocode
SocketConnection.waitUntilClosed(int timeout) {
  waitThread = Thread.currentThread();
  Timer.Timeable cb = new Timer.Timeable() {
    public void timeout() {
      waitThread.interrupt();
      socketInput.close();
      socketOutput.close();
    }
  };
  Timer.Entry e = timer.requestDelayedCallback(cb, timeout);
  try {
    while (true) {
      int i = socketInput.read();
      if (i == -1) break;    
    }
  }
  finally {
    timer.cancel(e);
  }
}

It first creates a Timeable object that will interrupt the thread and close the in- and output stream associated with the current socket after a given amount of timeout. Thereafter it reads from the socket input stream until it encounters an EOF condition. If the EOF is catched within the timeout period, the Timeable object is removed from the waiting timer thread and the method returns. Otherwise, if the stream was not closed by the harness within the timeout period, the timer thread will call the timeout() method of the Timeable object which in turn will call interrupt() on the waiting thread and explicitely close the socket streams. At this point, the thread waiting for the EOF condition on the socket stream, will finally get it and will return. Notice that in the latter case, the interrupt status of the thread is neither queried and transformed into an InterruptedException nor is it cleared.

At the end, handleRequest(), the caller of waitUntilClosed() will call the Task.close() method that will finally close the socket. Notice that handleRequest() doesn't handle the interrupt status of the thread either, so the thread will effectively stay marked as interrupted after a timeout has happened in waitUntilClosed(). This will ultimately lead to a failure in the next JCK test that calls one of the interruptible Object.wait(), Thread.join() or Thread.sleep() methods.

The solution of this problem is easy. We could either query and clear the interrupted status of the agent thread after the call to waitUntilClosed() in handleRequest()like so:

Solution 1: clear thread interrupt status in handleRequest()
  connection.waitUntilClosed(5000);
  if (Thread.interrupted() && tracing) {
    traceOut.println("Thread was interrupted - clearing interrupted status!");
  }

Notice, that the call to Thread.interrupted() queries and clears the interrupted status of a thread simultaneously. A slightly more elegant solution would probably be to query and clear the interrupted status already in the finally block of the waitUntilClosed() method of SocketConnection. In the case the thread was interrupted by the timer thread because of a timeout, the interrupted status of the thread should be cleared and an InterruptedException should be thrown. This exception will then be handled correctly in Task.handleRequest().

Solution 2: clear interrupt status and throw InterruptedException in waitUntilClosed()
  finally {
    timer.cancel(e);
    if (Thread.interrupted() {
      throw new InterruptedException();
    }
    ...
  }

With either of these two changes, the JCK tests will run more stable on machines with a high load an don't produce any spurious test failures because of InterruptedExceptions any more.

Notice that there's one last caveat: for some (to me yet unknown) reason, the JTHarness 3.2.2 executes the JCK tests in a folder in a different order than the harness that comes with the JCK test suit (and which is version 3.2_2). For example the tests in /api/java_lang/management/ThreadMXBean are executed in the following order (and succeed) with the original JCK harness:

api/java_lang/management/ThreadMXBean/index_TrdMBean_MB.jtr
api/java_lang/management/ThreadMXBean/index_TrdMBean.jtr
api/java_lang/management/ThreadMXBean/index_setTrdMBean_MB.jtr
api/java_lang/management/ThreadMXBean/index_setTrdMBean.jtr
api/java_lang/management/ThreadMXBean/index_infoTrdMBean_MB.jtr
api/java_lang/management/ThreadMXBean/index_infoTrdMBean.jtr

With the JTHarness 3.2.2, there is one test - namely TrdMBean - that will fail because the ThreadMXBean tests are executed in a different order:

api/java_lang/management/ThreadMXBean/index_infoTrdMBean.jtr
api/java_lang/management/ThreadMXBean/index_infoTrdMBean_MB.jtr
api/java_lang/management/ThreadMXBean/index_setTrdMBean.jtr
api/java_lang/management/ThreadMXBean/index_setTrdMBean_MB.jtr
api/java_lang/management/ThreadMXBean/index_TrdMBean.jtr
api/java_lang/management/ThreadMXBean/index_TrdMBean_MB.jtr

To avoid this problem, the test can be placed in the exclude file. This can of course only be done if the JCK run is not intended for certification! If anyone of the experts knows the reason why these tests get executed in a different order by the two versions of the test harness, please comment.

Although I'm referring to version 3.2.2 of JTHarness in this blog, the latest version 4.1.1 suffers from the same problem. I've therefore opened a bug report for this issue ("Bug in test agent causes random test failures (with InterruptedException)") which got the internal review ID 1109902. If you're interested in resolving this problem, you can vote for the bug once it appears in the bug database.

Update (2008-07-15): thanks to Brian Kurotsuchi this problem has been resolved in the latest version 4.1.4 of JTHarness (see jtharness issue 35).

Related Topics >>