The Source for Java Technology Collaboration
User: Password:



Ben Galbraith

Ben Galbraith's Blog

XML, Readers, and Streams: A Cautionary Tale

Posted by javaben on September 03, 2005 at 10:47 PM | Comments (5)

(Note: this entry is cross-posted on my personal blog site -- galbraiths.org/blog.)

If a system's glitches can be compared to fish, I want to tell you about my white whale.

A while back, I was working on a system feature that read in some XML from the filesystem, XSLT'd it into HTML, and served it up to a browser. The XML had a bunch of characters from the higher Unicode ranges (i.e., >255), and wouldn't you know, when viewed in a browser, these characters showed up as garbled data. Not "The Box"--that ugly little placeholder used when a font doesn't contain a character for a given code point--but usually one to three seemingly random characters that had nothing to do with the character that was supposed to be displayed.

Classic encoding problem.

For the uninitiated in character encodings, let me fill you in real quick. Disks store bytes, not characters. A byte is a numeric value between 0 and 255. To store characters on disks, a convention is used to map the numeric values of bytes to characters. In the early days of computing, we kept things simple and said that there could be no more than 256 different types of characters stored in files. Lately, we've taken to storing over 60,000 different types of characters. How do we represent that many values with just a byte?

Actually, that depends. An exceedingly large number of different conventions exist for mapping >256 characters to bytes. What all of these systems have in common is that multiple bytes are used to represent a single character. Two bytes can when used together represent 65,536 unique character types; with three bytes, bump that up to 16 million.

And therein lies the rub. Files don't indicate the encoding used within them. Indeed, there's no guarantee that the files store character values at all. The user must know what to expect within the file, and if its character data, they must know what encoding was used to store it.

Back to the story. I knew it was an encoding glitch; multiple characters showing up in place of one is a classic symptom (because multiple bytes represented the character, but the parser treated each byte as a unique character). I immediately assumed that the browser or the servlet (or the web framework on top of it) was to blame. I spent a lot of time educating myself on how encodings work over the web. I threw hours at the problem here and there and came up empty handed each time.

And then, whilst reading through some of the backend code, I saw this innocuous little line:

Document document = new SAXBuilder().build(new FileReader(file));

See the problem? Look again. Notice the FileReader? I'm such an idiot. Here's the deal. XML files can contain any of thousands of different Unicode characters and can use a bunch of different encodings to map those to bytes. The encoding used on a particular XML document is indicated in the prolog, such as:

<?xml version="1.1" encoding="UTF-8"?>

I don't really use XML 1.1; I just put that in to piss off Elliotte. ;-) Note the encoding. Now, back to our FileReader. Readers in Java are nice because they handle converting bytes into characters automatically. But in order to do that, they have to know what encoding was used on the bytes they are being handed. If you don't specify an encoding, a Reader will use the operating system's default encoding.

Ahhh, and there's our problem. PCs, Macs, *nix, they all use different encoding schemes by default, and they ain't UTF-8 (actually, on some *nixs it might be, I dunno). My XML files were UTF-8 encoded. So when I used a Reader to parse my XML file, the XML parser was misinterpreting many of my characters.

This is the code I should have written:

Document document = new SAXBuilder().build(new FileInputStream(file));

If you hand an XML parser bytes, which is the currency of InputStreams, the parser handles converting those bytes to characters itself, and uses the encoding in the XML prolog to configure itself for that process. If you hand it characters... it's stuck using those characters and can't affect the decoding process one whit, since it occurs a level beneath it.

It turns out this is a rather insidious bug. Because most encodings are the same in how they assign the characters mapped to byte values 0-255 (since the ASCII standard was so pervasive), and because those are by far the most common characters for most folks here in the United States, you can go a long way with character encoding bugs like this and never know any different. But the day you add a higher value character... weird things happen.

Learn from me. Spare yourself the pain of wrestling with this one yourself. Make me feel my time was well spent. Never, ever use a Reader to parse in an XML file. There's already a great system for letting the parser handle the decoding; let it.


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • It's a very common error to pick up the machines default character encoding/locale/time. There are cases of the JRE itself doing it. Turks must hate Java.

    Posted by: tackline on September 04, 2005 at 05:39 PM

  • There can be a bit of danger storing XML in Strings, too.

    About the most "fun" I ever had was packing XML data in a MIME email-message in .NET :-)

    Posted by: tobega on September 05, 2005 at 02:05 AM

  • I've already suggested on forums that FileReader should be deprecated. Exactly for this reason.

    Posted by: podlesh on September 06, 2005 at 07:31 AM

  • "....Turks must hate Java."
    it is not spesific to Java. Same kind of problems exist in any encoding related thing for Turkish (OS, .Net, you name it.) . Especially the infamous "dotless i". There are spesific locale controls just for that character in JDK code. Even .Net came much later then Java, they managed to screw up with Tukish encoding anyway, so they say claim it is fixed in never being released 2.0.

    http://msdn.microsoft.com/netframework/default.aspx?pull=/library/en-us/dndotnet/html/StringsinNET20.asp

    Posted by: ahmetaa on September 06, 2005 at 10:08 AM

  • podlesh: Wow dude, that's awfully draconian. Why not just remove/deprecate methods on XML parsers that take Readers? Ah, that's problematic too, because once you have a pipeline process, you'll only want to convert from bytes to characters once, instead of continuously (and pointlessly) converting from bytes to characters to bytes as the XML streams through the pipeline.

    Perhaps the best solution is to force APIs or implementations to check if a FileReader is passed, and warn the user or throw an exception.

    Posted by: javaben on September 07, 2005 at 12:20 PM





Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds