The Source for Java Technology Collaboration
User: Password:



Kohsuke Kawaguchi

Kohsuke Kawaguchi's Blog

XML processing pitfall: InputStream

Posted by kohsuke on October 07, 2005 at 12:43 PM | Comments (6)

Many XML parser APIs accept InputStream or Reader. For example, JAXB unmarshaller has unmarshal(InputStream), StAX has XMLInputFactory.createStreamReader(InputStream), XStream has XStream.fromXML(Reader). So all too often you'd write something like:
XMLInputFactory xif = ...;
xif.createStreamReader(new FileInputStream("data/foo.xml"));
Or maybe:
XMLInputFactory xif = ...;
xif.createStreamReader(getClass().getResourceAsStream("data.xml"));
The problem with this shows itself when you have references to other files in your XML file, such as:
<?xml version='1.0' ?>
<!DOCTYPE root [
  <!ENTITY ent SYSTEM "x/ent">
]>
<root> ... </root>
Or maybe:
<root>
  <x:include href="another-file.xml" />
</root>
In general, it doesn't work if your XML file has relative references to other resources, because the parser (or the unmarshaller or whatever) doesn't know the base URI to resolve a relative reference with.

To make the issue even more complicated, some parser, such as Xerces (at least some version of it), try to resolve it against the current directory, which sometimes work (and break as soon as you deploy your apps in producion!) Some other parsers, such as Aelfred, does a better job of issueing a warning in this situation.

Another factor that makes the situation worse is the poorly designed APIs. For example, XStream doesn't offer any version of the fromXML method that allows you to pass the URI of the document. So it's not only error-prone, but it's actually impossible to make it resolve relative references correctly.

StAX is marginally better, as it offers XMLInputFactory.createXMLStreamReader(String,InputStream), which lets you pass in the URI. But unless you are an XML geek, it would probably never occur to you that you need to use this version, as opposed to more simpler createXMLStreamReader(InputStream). Besides, you need to turn a file name into URL, so the code will look like:

File file = new File("data/foo.xml");
xif.createXMLStreamReader( file.toURL().toExternalForm(), new FileInputStream(file) );
... which isn't exaclty the simplest code in the world.

SAX API does a much better job, as you'd be using XMLReader.parse(String) version. It's both the intuitive version as well as the correct version at the same time. The only little downside is that it's not type-safe, so at the first glance, you aren't sure if you need to pass in the URL form or the file form (it actually works in both forms in most of the parsers --- don't know if it's required by the SAX API.) JAXB does it slightly better, as it exposes Unmarshaller.unmarshal(File), thereby eliminating the type-safety issue.

The other benefit of having an API that just asks you the name of the XML file is that the implementation can choose the right buffering strategy without any redundancy. If an API accepts InputStream, some implementations want you to do the buffering, while some others do the buffering on its own. So you have this little guess game of whether you should wrap your InputStream to BufferedInputStream or not.

It's just one example of why it's hard to design an API that "just works."


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • I don't agree.

    parse(InputStream) is the correct method. It is the most flexible and general method. It's just completely wrong to assume parsers will always be parsing a File on the desk. There are many cases where you're parsing a memory buffer (from parsing an HTTP request to dynamic XML). If you want to set the base URI or any other attribute that modifes the parser behavior you should invoke a setter eg setBaseUri. If the parser encounters an xinclude and no base URI is set it should print a warning and use the current working directory. Aelfred and XStream have it right.

    Posted by: ocean on October 07, 2005 at 07:23 PM

  • Oh, I'm sure it's a correct method when you are parsing from byte[] or a socket or things of that sort. Sure. I'm not saying that you need to get rid of it. In fact all the APIs that I mentioned do let you parse from InputStream.


    There are two points that I was tring to make. One is that there's a potentially suprising effect in using the InputStream version (namely resource references might not work.) The other is that if you are parsing from a file, the API should make it easy for developers to do the right thing. Requiring them to invoke setBaseUri just doesn't cut it, IMO.

    Posted by: kohsuke on October 07, 2005 at 09:32 PM

  • In SAX API there is an EntityResolver that can have custom lookup logic.

    Posted by: euxx on October 08, 2005 at 12:00 AM

  • Yes, EntityResolver is useful for certain things. But for it to work correctly, I think the first resource that gets parsed needs to have an URI associated with it.

    Posted by: kohsuke on October 08, 2005 at 12:04 AM

  • It's worse than you thought. file.toURL().toExternalForm() doesn't always work. The URL object that toURL() returns does not correctly implement the URL specification for many characters likely to occur in file names, including the simple space (0x20). This can trip up relative URL resolution too.

    In Java 1.4 and later you should use file.toURI().toASCIIString() instead. In Java 1.3 and earlier, you'll need to build the URL from the file name yourself because Java's implementation is just too broken.

    Posted by: elharo on October 08, 2005 at 05:37 AM

  • Ahh, good to know. Now that you mentioned it, I remember dealing with whitespaces in file names and URLs, but I didn't know that using file.toURI().toASCIIString() would have done it correctly.

    Posted by: kohsuke on October 08, 2005 at 10:16 AM





Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds