Skip to main content

BOMed out by Notepad and javac

Posted by cayhorstmann on April 10, 2012 at 9:50 PM PDT

I've been too busy to blog for quite some time, but today something happened that seemed strange enough to break my silence. A student came to me with a Java source file that the grading script rejected. We looked at it and couldn't figure out why. I unearthed the error message: ♦

MergeSorter.java:1: error: illegal character: \65279
import java.util.Random;
^
1 error

Huh? What's \65279? Why the backslash? I didn't even know what notation
that is. I looked at the file with Emacs hexl-mode and saw that the first
three bytes were hex EF BB BF. In all these years, I had
never seen that, but Google set me straight. It's the Unicode byte order
mark or BOM. I asked the student what editor he had used to produce this
file. Sure enough, it was Notepad. Of course. If I had the power to
eradicate one program from the face of the earth, it surely would be
Notepad.

.

Just in case you haven't been down this particular rathole before,
here's a refresher on the BOM. At one point in time, Unicode fit into 16
bit, and it seemed attractive to encode it with fixed-width 16-bit
quantities. For example, an uppercase A is hexadecimal 0041, so you have
one byte of 00 and one byte of 41. Or do you? In a little-endian platform
such as Intel, it would be more convenient to have a byte of 41 followed
by a byte of 00. Rather than lamely settling on either little-endian or
big-endian encoding, Unicode gives a much more interesting choice. Your
file can start out with the byte order mark, hexadecimal FEFF. If it shows
up as FE FF when reading a byte at a time, the data is big-endian, and if
it shows up as FF FE, it's little-endian.

.

But UTF-16 is so last millennium. Now Unicode has grown to 20 bit.
While one could theoretically encode it fixed-length with 3-byte or 4-byte
values, just about everyone uses the more economical UTF-8 instead. That's
a variable-length encoding. 7-bit ASCII is embedded as 0bbbbbbb, where
each b is a bit. Then we have a bunch of two-byte codes of the form
110bbbbb 10bbbbbb, followed by three-byte codes 1110bbbb 10bbbbbb
10bbbbbb, and so on. EF BB BF happens to be the three-byte encoding of the
BOM. Work it out for yourself as an exercise! And, by the way, the decimal
value is 65279.

But who needs a byte order mark for UTF-8? There are no two ways of
ordering the bytes. The first byte is always the one starting with
something other than 10, and the others always start with 10. Why would
Notepad put a BOM into an UTF-8 document? That's actual work. Usually,
Notepad is stupid, not evil. So I checked the Unicode spec here. They say it's
perfectly ok to add a BOM in front of a file, and it might actually be
useful because it allows a guess that this is a UTF-8 encoded file. If you
open the file, knowing that it is UTF-8, you should ignore it.

That's fair. So Java, which, as we all know, loves Unicode, will surely
do the right thing, read the BOM and ignore it in a file that's opened
with UTF-8 encoding. Umm, no. Check out this and this bug report.
The folks at Sun were wringing their hands and wailed how fixing this bug
would break a whole bunch of "customer" tools. Which turned out to be the
Sun app server.

Well, guess what. Not fixing the bug breaks javac which
now rejects perfectly valid UTF-8 source files.

Why didn't I notice this earlier? I guess I have finally reached the
point where students configure Windows to use UTF-8 and not some archaic
Microsoft-specific 8-bit encoding. That's good. Now we just need
javac to read those UTF-8 files. If Notepad can, surely javac
can too.

Related Topics >>

Comments

Earlier this week, I was burned by BOMs when attempting STaX ...

Earlier this week, I was burned by BOMs when attempting STaX parsing on some XML. I'd somehow assumed they were for UTF-16 only...

While figuring that out, I discovered org.apache.commons.io.input.BOMInputStream which skips those bytes. Just in case that helps anyone.

UTF-16 is a variable length encoding, too. The characters in ...

UTF-16 is a variable length encoding, too. The characters in the BMP are encoded using just 2 bytes. Those in other planes, though, use 2 pairs of 2 bytes for a total of 4 bytes. The bytes appear either in big endian (UTF-16BE) or little endian (UTF-16LE) order.

Hence, UTF-32, UTF-16 and UTF-8 can all encode the entire Unicode repertoire. Neither is old-fashioned, all are equivalent and 3rd millennium-ready. (Besides, it takes 21 bits, not 20, to encode the 17 planes in Unicode.)

A study of the java.lang.String API, reveals that the internal encoding is UTF-16, with some methods accepting and returning int rather than char values to accommodate for all Unicode characters.

The bad choice of the Java designer was to limit char to 16 bits. At that time, the need to increase the Unicode repertoire to cover more planes was already in sight.

In the end, I wouldn't blame Notepad, a crappy app, I agree, but the ignorance of its users ;-)