Skip to main content

Reading stdin may cause your JVM to hang

Posted by kohsuke on September 28, 2009 at 6:12 PM PDT

Today I've digged deeper into one of the nastiest problems I ever encounted since I became a Java programmer.

The problem was this — I have the Hudson slave agent program, which blocks on stdin for read almost all the time. This process uses other threads to perform other activities (what this thread does is actually driven by what it reads from stdin, but that's irrelevant to this bug.)

On Windows, I've seen on several occasions that this process hangs. The interesting thing about this hang is that it doesn't react to jstack, and so I couldn't see the stack trace (whereas if this was caused by a usual Java level deadlock, say, incorrect use of synchronized, jstack would show you how the deadlock is occurring.)

If I bring in a Windows native debugger, I can see a number of threads all suspending at Win32 system calls (KiFastSystemCallRet, to be exact), but I couldn't really make any sense out of it. You see, when you look at a Java process from a Windows debugger, mostly you just see what appears to be a JIT-generated code in the call stack, so it really doesn't tell you what's going on.

In an attempt to better understand the problem, I start a JVM with a debug option, which magically makes the problem disappear. So for a longest time, my "workaround" for this problem was to run the JVM with a debugger support.

A break-through happened, when I started seeing this in my unit tests, and so decided to revisit this issue. Instead of usual Visual Studio debugger, this time I was using WinDbg because I didn't wait to install Visual Studio on this virtual machine that I run Windows in.

After attaching to this JVM, WinDbg failed to break the execution. But unlike Visual Studio, it reported this helpful message:

Break-in sent, waiting 30 seconds...
WARNING: Break-in timed out, suspending.
         This is usually caused by another thread holding the loader lock

What's the "loader lock"? Does anyone hold it? With this info, I started looking at the stack trace more carefully, and indeed I found the thread that appears to be holding this lock:

  12  Id: 814.8b8 Suspend: 1 Teb: 7ffa8000 Unfrozen
ChildEBP RetAddr 
02f1e750 7c90d99a ntdll!KiFastSystemCallRet
02f1e754 7c810f63 ntdll!NtQueryVolumeInformationFile+0xc
02f1e784 77c2c9f9 kernel32!GetFileType+0x7e
02f1e7e8 77c1f01d msvcrt!_ioinit+0x19f
02f1e88c 7c90118a msvcrt!__CRTDLL_INIT+0xac
02f1e8ac 7c91c4fa ntdll!LdrpCallInitRoutine+0x14
02f1e9b4 7c916371 ntdll!LdrpRunInitializeRoutines+0x344
02f1ec60 7c9164d3 ntdll!LdrpLoadDll+0x3e5
02f1ef08 7c801bbd ntdll!LdrLoadDll+0x230
02f1ef70 7c801d72 kernel32!LoadLibraryExW+0x18e
02f1ef84 7c801da8 kernel32!LoadLibraryExA+0x1f
02f1efa0 77de8830 kernel32!LoadLibraryA+0x94
02f1f05c 6d3eb1be ADVAPI32!CryptAcquireContextA+0x512
WARNING: Stack unwind information not available. Following frames may be wrong.
02f1f13c 6d99c844 java_6d3e0000!Java_sun_security_provider_NativeSeedGenerator_nativeGenerateSeed+0x6e
02f1f154 6d99cf9c jvm!AsyncGetCallTrace+0x3f374
02f1f1cc 6da241e1 jvm!AsyncGetCallTrace+0x3facc
02f1f210 6d99d01d jvm!JVM_FindSignal+0x53111
02f1f22c 6d98d64c jvm!AsyncGetCallTrace+0x3fb4d
02f1f2e8 6d98d88b jvm!AsyncGetCallTrace+0x3017c
02f1f300 6d98db52 jvm!AsyncGetCallTrace+0x303bb

Someone is calling sun.security.provider.NativeSeedGenerator.nativeGenerateSeed(...), which results in the system loading a DLL — presumably some cryptography related DLL. The initialization of this includes _ioinit from Visual C runtime library (MSVCRT), which results in the GetFileType system call, and that's hanging.

For those of you lucky enough who have never done Win32 programming, GetFileType is a Windows system call that tells you the kind of a file descriptor, given a file descriptor (why C runtime needs to know, I have no idea.) MSDN article on GetFileType doesn't say anything, but the source code of _ioinit has a comment that says GetFileType will hang if there are other pending read operations on the same file handle.

Of course, I know of one such thread — the one that's reading stdin! I can see its stack trace through WinDbg:

  10  Id: 814.880 Suspend: 1 Teb: 7ffaa000 Unfrozen
ChildEBP RetAddr 
02e7d7e4 7c90d9da ntdll!KiFastSystemCallRet
02e7d7e8 7c801879 ntdll!NtReadFile+0xc
02e7d850 6d3e4a78 kernel32!ReadFile+0x16c
WARNING: Stack unwind information not available. Following frames may be wrong.
02e7d870 6d3e422d java_6d3e0000!handleRead+0x25
02e7f898 6d3e1dae java_6d3e0000!VerifyClassCodesForMajorVersion+0x16a
02e7f8b8 00979e27 java_6d3e0000!Java_java_io_FileInputStream_readBytes+0x1d
02e7f904 00972ec5 0x979e27
02e7f944 00972ec5 0x972ec5
02e7f97c 00972da1 0x972ec5
02e7f9bc 00972ec5 0x972da1
02e7f9ec 00972ec5 0x972ec5
02e7fb14 6d99c844 0x972ec5
02e7fb2c 6d99cf9c jvm!AsyncGetCallTrace+0x3f374
02e7fba4 6da241e1 jvm!AsyncGetCallTrace+0x3facc
02e7fbe8 6d99d167 jvm!JVM_FindSignal+0x53111
02e7fc24 6d99d1dd jvm!AsyncGetCallTrace+0x3fc97
02e7fcd0 6d9c6290 jvm!AsyncGetCallTrace+0x3fd0d
02e7fd24 6da80424 jvm!jio_fprintf+0x140
02e7fd50 6da23e5c jvm!JVM_FindSignal+0xaf354
02e7ff80 7c349565 jvm!JVM_FindSignal+0x52d8c

And this makes it easy to understand why the use of the debug option eliminates the issue — it presumably changes the order of the DLL loading, in such a way that this race condition doesn't occur, at least with the crypto DLL.

Fix

Now I understand the problem — if you have a thread that's doing a blocking read on stdin, DLL load will block. The reason this problem doesn't manifest more often is that as soon as you send something to its stdin, the deadlock will resolve itself. Also, you only tend to actually load more DLLs early in the life of a process (later LoadLibrary calls more likely just find what you've already loaded, rather than loading a brand-new DLL.)
This explains why I only see this during the start up of a Hudson slave.

So the challenge now is how to fix it, and this is the part that I still don't have a good answer, and this is a part of the reason why I'm writing this. I thought of 2 approaches, but neither of them work:

Duplicating the handle into a higher value

Upon reading _ioinit source code, I noticed that GetFileType call only happens for stdin,stdout, and stderr. So if I could DuplicateHandle stdin to a higher file descriptor number, I can work around this problem rather nicely. Unfortunately, Java doesn't appear to have any API to duplicate a file handle. So in short of using JNI or JNA, I can't duplicate a handle, and that's a dependency I couldn't afford in this part of Hudson.

Read stdin with timeout

If I can read from stdin with timeout and instead do so in a loop, I can let other threads proceed with GetFileType, thus breaking the deadlock. Unfortunately, FileChannel is not SelectableChannel, so even with NIO, I really can't seem to issue a read with timeout.

In the end, I resorted to a very poor approach; I call SecureRandom before I start reading stdin to eagerly load whatever crypto DLL that it needs to load. This apparently "fixes" the problem in this case, but clearly, it's not a real solution as it can happen with any DLL.

If anyone has a better idea about how to fix this, please let me know. And hopefully Google indexes this blog so that others in the same boat won't have to waste as much time as I did.

Related Topics >>

Comments

I have question regarding stack trace you have displayed ...

I have question regarding stack trace you have displayed here.. How did you get java_* modules to show up in the trace ? e.g.
java_6d3e0000!<
java_6d3e0000!handleRead+0x25
02e7f898 6d3e1dae java_6d3e0000!VerifyClassCodesForMajorVersion+0x16a
02e7f8b8 00979e27 java_6d3e0000!Java_java_io_FileInputStream_readBytes+0x1d

I have a crash dump file for tomcat server running on Windows machine. I need to find out which java method was invoked just before tread crashed. I am using WinDbg tool and got latest symbols from Windows website. But still most of the java symbols are defaulting to jvm.dll keeping statements like -
248 Id: a58.106c Suspend: 0 Teb: 000007ff`ffdc4000 Unfrozen
Child-SP RetAddr Call Site
00000000`117be548 000007fe`fd8c10ac ntdll!ZwWaitForSingleObject+0xa
00000000`117be550 00000000`6dba9c6f KERNELBASE!WaitForSingleObjectEx+0x79
00000000`117be5f0 00000000`6dc56635 jvm!JVM_FindSignal+0x13869f
00000000`117be620 00000000`6da56c36 jvm!JVM_FindSignal+0x1e5065
00000000`117be6d0 00000000`022d8595 jvm!JVM_MonitorWait+0x1d6
00000000`117be750 00000000`00000000 0x22d8595

Any help to decode 'jvm!JVM_FindSignal+' to specific java method would be helpful !
Thanks

Console?

Out of curiosity, did this problem also happen if you accessed the System.in/err/out via Console?

Yes, because the problem is

Yes, because the problem is at Win32 level, and Console still uses FileInputStream underneath.

This problem can be found in

This problem can be found in Microsoft bug database at https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?Fe... , and it appears to be fixed in Windows Vista and onward.

Would you say this is a

Would you say this is a problem in Windows or in the JVM? Is this also applicable to Windows application that does similar things?

It's a problem in Windows,

It's a problem in Windows, that application developers can work around, if we have a bit of help from JavaSE libraries. Namely, the ability to duplicate a file handle. I think it's a reasonable feature to ask.

System.in.available()

This problem is in the Sun bug tracker (4809647). One workaround is to try using System.in.available() to do non-blocking IO: you only call System.in.read() when System.in.available() returns > 0. I've used this method with some success for doing non-blocking socket I/O (without NIO). However, I'm not sure how well it works for System.in because available() is known to be pretty unreliable, e.g. it may only return > 0 after a newline is written to stdin.

I imagine that a real fix can only be done at the JRE level by ensuring that it never blocks forever on the ReadFile call, maybe by using PeekNamedPipe, ReadFileEx, or some such Win32 nonsense :)

Re: System.in.available()

Another problem in the System.in.available()-based approach is that it'll induce a delay when the data arrives, because the read thread has to basically do Thread.sleep() for a fixed amount of time.

Thanks for a reference to 3809647. I've updated that bug to include a reference to this investigation. The person who examine that bug in 2002 obviously have failed to get to the bottom of the problem.

FileChannel are interruptible

FileChannels are interruptible through http://java.sun.com/javase/6/docs/api/java/nio/channels/InterruptibleCha... : "If a thread is blocked in an I/O operation on an interruptible channel then another thread may invoke the channel's close method." It looks like a quite brute force method but perhaps it works. Interesting article and nice investigation. Johannes

Re: FileChannel are interruptible

Right, but I surely don't want to close the stream. If the read operation responds to Thread.interrupt(), maybe another approach is to use another thread to constantly interrupt the thread that does the read (but if I do that, now the problem is that I can't use interruption to signal the termination.)