Search |
||
Reading stdin may cause your JVM to hangPosted by kohsuke on September 28, 2009 at 6:12 PM PDT
Today I've digged deeper into one of the nastiest problems I ever encounted since I became a Java programmer. The problem was this — I have the Hudson slave agent program, which blocks on stdin for read almost all the time. This process uses other threads to perform other activities (what this thread does is actually driven by what it reads from stdin, but that's irrelevant to this bug.) On Windows, I've seen on several occasions that this process hangs. The interesting thing about this hang is that it doesn't react to jstack, and so I couldn't see the stack trace (whereas if this was caused by a usual Java level deadlock, say, incorrect use of synchronized, jstack would show you how the deadlock is occurring.) If I bring in a Windows native debugger, I can see a number of threads all suspending at Win32 system calls (KiFastSystemCallRet, to be exact), but I couldn't really make any sense out of it. You see, when you look at a Java process from a Windows debugger, mostly you just see what appears to be a JIT-generated code in the call stack, so it really doesn't tell you what's going on. In an attempt to better understand the problem, I start a JVM with a debug option, which magically makes the problem disappear. So for a longest time, my "workaround" for this problem was to run the JVM with a debugger support. A break-through happened, when I started seeing this in my unit tests, and so decided to revisit this issue. Instead of usual Visual Studio debugger, this time I was using WinDbg because I didn't wait to install Visual Studio on this virtual machine that I run Windows in. After attaching to this JVM, WinDbg failed to break the execution. But unlike Visual Studio, it reported this helpful message:
Break-in sent, waiting 30 seconds...
WARNING: Break-in timed out, suspending.
This is usually caused by another thread holding the loader lock
What's the "loader lock"? Does anyone hold it? With this info, I started looking at the stack trace more carefully, and indeed I found the thread that appears to be holding this lock: 12 Id: 814.8b8 Suspend: 1 Teb: 7ffa8000 Unfrozen ChildEBP RetAddr 02f1e750 7c90d99a ntdll!KiFastSystemCallRet 02f1e754 7c810f63 ntdll!NtQueryVolumeInformationFile+0xc 02f1e784 77c2c9f9 kernel32!GetFileType+0x7e 02f1e7e8 77c1f01d msvcrt!_ioinit+0x19f 02f1e88c 7c90118a msvcrt!__CRTDLL_INIT+0xac 02f1e8ac 7c91c4fa ntdll!LdrpCallInitRoutine+0x14 02f1e9b4 7c916371 ntdll!LdrpRunInitializeRoutines+0x344 02f1ec60 7c9164d3 ntdll!LdrpLoadDll+0x3e5 02f1ef08 7c801bbd ntdll!LdrLoadDll+0x230 02f1ef70 7c801d72 kernel32!LoadLibraryExW+0x18e 02f1ef84 7c801da8 kernel32!LoadLibraryExA+0x1f 02f1efa0 77de8830 kernel32!LoadLibraryA+0x94 02f1f05c 6d3eb1be ADVAPI32!CryptAcquireContextA+0x512 WARNING: Stack unwind information not available. Following frames may be wrong. 02f1f13c 6d99c844 java_6d3e0000!Java_sun_security_provider_NativeSeedGenerator_nativeGenerateSeed+0x6e 02f1f154 6d99cf9c jvm!AsyncGetCallTrace+0x3f374 02f1f1cc 6da241e1 jvm!AsyncGetCallTrace+0x3facc 02f1f210 6d99d01d jvm!JVM_FindSignal+0x53111 02f1f22c 6d98d64c jvm!AsyncGetCallTrace+0x3fb4d 02f1f2e8 6d98d88b jvm!AsyncGetCallTrace+0x3017c 02f1f300 6d98db52 jvm!AsyncGetCallTrace+0x303bb Someone is calling sun.security.provider.NativeSeedGenerator.nativeGenerateSeed(...), which results in the system loading a DLL — presumably some cryptography related DLL. The initialization of this includes _ioinit from Visual C runtime library (MSVCRT), which results in the GetFileType system call, and that's hanging. For those of you lucky enough who have never done Win32 programming, GetFileType is a Windows system call that tells you the kind of a file descriptor, given a file descriptor (why C runtime needs to know, I have no idea.) MSDN article on GetFileType doesn't say anything, but the source code of _ioinit has a comment that says GetFileType will hang if there are other pending read operations on the same file handle. Of course, I know of one such thread — the one that's reading stdin! I can see its stack trace through WinDbg: 10 Id: 814.880 Suspend: 1 Teb: 7ffaa000 Unfrozen ChildEBP RetAddr 02e7d7e4 7c90d9da ntdll!KiFastSystemCallRet 02e7d7e8 7c801879 ntdll!NtReadFile+0xc 02e7d850 6d3e4a78 kernel32!ReadFile+0x16c WARNING: Stack unwind information not available. Following frames may be wrong. 02e7d870 6d3e422d java_6d3e0000!handleRead+0x25 02e7f898 6d3e1dae java_6d3e0000!VerifyClassCodesForMajorVersion+0x16a 02e7f8b8 00979e27 java_6d3e0000!Java_java_io_FileInputStream_readBytes+0x1d 02e7f904 00972ec5 0x979e27 02e7f944 00972ec5 0x972ec5 02e7f97c 00972da1 0x972ec5 02e7f9bc 00972ec5 0x972da1 02e7f9ec 00972ec5 0x972ec5 02e7fb14 6d99c844 0x972ec5 02e7fb2c 6d99cf9c jvm!AsyncGetCallTrace+0x3f374 02e7fba4 6da241e1 jvm!AsyncGetCallTrace+0x3facc 02e7fbe8 6d99d167 jvm!JVM_FindSignal+0x53111 02e7fc24 6d99d1dd jvm!AsyncGetCallTrace+0x3fc97 02e7fcd0 6d9c6290 jvm!AsyncGetCallTrace+0x3fd0d 02e7fd24 6da80424 jvm!jio_fprintf+0x140 02e7fd50 6da23e5c jvm!JVM_FindSignal+0xaf354 02e7ff80 7c349565 jvm!JVM_FindSignal+0x52d8c And this makes it easy to understand why the use of the debug option eliminates the issue — it presumably changes the order of the DLL loading, in such a way that this race condition doesn't occur, at least with the crypto DLL. FixNow I understand the problem — if you have a thread that's doing a blocking read on stdin, DLL load will block. The reason this problem doesn't manifest more often is that as soon as you send something to its stdin, the deadlock will resolve itself. Also, you only tend to actually load more DLLs early in the life of a process (later LoadLibrary calls more likely just find what you've already loaded, rather than loading a brand-new DLL.) This explains why I only see this during the start up of a Hudson slave. So the challenge now is how to fix it, and this is the part that I still don't have a good answer, and this is a part of the reason why I'm writing this. I thought of 2 approaches, but neither of them work:
In the end, I resorted to a very poor approach; I call SecureRandom before I start reading stdin to eagerly load whatever crypto DLL that it needs to load. This apparently "fixes" the problem in this case, but clearly, it's not a real solution as it can happen with any DLL. If anyone has a better idea about how to fix this, please let me know. And hopefully Google indexes this blog so that others in the same boat won't have to waste as much time as I did. »
Comments
Comments are listed in date ascending order (oldest first)
Re: FileChannel are interruptible
Submitted by kohsuke on Tue, 2009-09-29 12:06.
Right, but I surely don't want to close the stream. If the read operation responds to Thread.interrupt(), maybe another approach is to use another thread to constantly interrupt the thread that does the read (but if I do that, now the problem is that I can't use interruption to signal the termination.)
System.in.available()
Submitted by skyewire on Wed, 2009-09-30 14:22.
This problem is in the Sun bug tracker (4809647). One workaround is to try using System.in.available() to do non-blocking IO: you only call System.in.read() when System.in.available() returns > 0. I've used this method with some success for doing non-blocking socket I/O (without NIO). However, I'm not sure how well it works for System.in because available() is known to be pretty unreliable, e.g. it may only return > 0 after a newline is written to stdin.
I imagine that a real fix can only be done at the JRE level by ensuring that it never blocks forever on the ReadFile call, maybe by using PeekNamedPipe, ReadFileEx, or some such Win32 nonsense :) Re: System.in.available()
Submitted by kohsuke on Wed, 2009-09-30 16:14.
Another problem in the System.in.available()-based approach is that it'll induce a delay when the data arrives, because the read thread has to basically do Thread.sleep() for a fixed amount of time.
Thanks for a reference to 3809647. I've updated that bug to include a reference to this investigation. The person who examine that bug in 2002 obviously have failed to get to the bottom of the problem.
Would you say this is a
Submitted by redsolo on Thu, 2009-10-01 06:01.
Would you say this is a problem in Windows or in the JVM? Is this also applicable to Windows application that does similar things?
It's a problem in Windows,
Submitted by kohsuke on Thu, 2009-10-01 10:02.
It's a problem in Windows, that application developers can work around, if we have a bit of help from JavaSE libraries. Namely, the ability to duplicate a file handle. I think it's a reasonable feature to ask.
This problem can be found in
Submitted by kohsuke on Thu, 2009-10-01 12:09.
This problem can be found in Microsoft bug database at https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?Fe... , and it appears to be fixed in Windows Vista and onward.
Console?
Submitted by malachid on Thu, 2009-10-08 01:40.
Out of curiosity, did this problem also happen if you accessed the System.in/err/out via Console?
Yes, because the problem is
Submitted by kohsuke on Thu, 2009-10-08 09:12.
Yes, because the problem is at Win32 level, and Console still uses FileInputStream underneath.
|
||
|
|
FileChannel are interruptible