Skip to main content

NetBeans 6.1, UTF-8 encoded source files, and a tale of corruption

Posted by joconner on March 30, 2008 at 2:59 AM PDT

I'm always happy when a company or product adopts Unicode as its charset. I think it makes perfect sense to do so. There are lots of good reasons why standardizing on Unicode is the right thing:

  • the data charset can represent all modern, useful -- and many not-so-useful -- scripts
  • charset consistency helps prevent data loss across application boundaries, including those among application, business tiers, application servers, and databases
  • a single, multilingual charset simplifies the localization process

I was pleased to see that NetBeans 6.0 and the 6.1 beta uses the UTF-8 encoding (a Unicode encoding) as its default for project configuration and source files. The following figure shows the default setting in the project's property sheet:

NB 6.1 project properties


This makes it much easier to edit non-ASCII, non-English source and property files. You can type text in any supported Unicode script right into Java source code. A legitimate usage would be comments or even localizable text in ListResourceBundle files. You can do that because NetBeans will save your files using the UTF-8 encoding by default...and NetBeans even puts this setting into the project's Ant file settings so that it compiles the file properly using the correct javac arguments. For example, NetBeans would use the following for a UTF-8 source file:

javac -encoding UTF-8 YourSource.java

Despite the potential benefits of this, NetBeans 6.1 still doesn't support this correctly in my humble opinion. Why not? Well, the biggest reason is simple: file corruption and permanent data loss. Ouch!

Let's take a simple "Hello, world!" example in Japanese. This is simple for NetBeans because of the UTF-8 encoding. The NetBeans editor even displays it correctly as shown here:


Unfortunately, the joy of this discovery was short lived when I discovered how easy it is to corrupt this data. Feel like experimenting with the charset encoding? Surely someone will. I suspected what would happen, so I didn't do this with any substantial code base...but someone will. I sure hope they use version control software.

Reopen that project property sheet, select another encoding, say ISO-8859-1 or windows-1252 since so many American and Western-European programmers use that by default. What happens? It appears as if nothing has happened. In this NB 6.1 beta, I see the same editor screen as before. NetBeans apparently hasn't yet reloaded the file from disk, so I'll force the issue by closing and reloading this file. The result is here:


Some of you, the super careful, nit-picky ones will now argue with me, "But John, you haven't really lost anything yet. 8859-1 and CP 1252 don't have those characters, but the original byte values are still entact. You can get them back in this example." OK, I concede the point. But now I'll show you some serious data loss, no messing around this time. Instead of ISO-8859-1, pick US-ASCII as the target charset. Add a comment or another line of code to the file. Save. And there you have it as shown here:


Now that's just not good. Did NetBeans save the file correctly? Sure. However, NetBeans can do better than this. I would argue that if NetBeans knows that the target encoding does not support the source encoding, it should at least warn the user that the resulting file will contain garbage characters and that parts of the file will be lost--permanantly in many cases.

So, just in case anyone over there in the NetBeans developer group can hear me...you have to fix this. Yes, I know it's a silly mistake for someone to do this, but NetBeans can help them avoid the problem. Just provide a warning dialog, "Saving this file in the target encoding will cause data loss because the target encoding does not support all characters in this file or project." Keep the encoding feature, just perfect it by helping some users avoid this costly mistake. The fact is that most software developers still don't understand character sets and encodings, and this is just an accident waiting to happen.

On a personal note: I really love NetBeans. And I hope this blog qualifies me for the NetBeans 6.1 blogging contest! I could probably file this under the "suggestions on how to enhance NetBeans 6.1" category.

Related Topics >>

Comments

Yea. Charset issues are ugly. What we've done on more recent projects is to just mandate UTF-8 for java source files (and some other file types). A pre-commit hook script (we use Subversion) verifies this and rejects any text files that do not appear to be valid UTF-8.

The method is a hack. Using python: (1) load file (as bytes), (2) decode to unicode under the assumption that it is UTF-8, (3) encode the unicode as UTF-8, yielding bytes. If no exceptions are thrown and the byte string in 1 is identical to the byte string in 3 it is accepted as UTF-8. This whole dance only works because (unlike other 8-bit encodings) not every possible sequence of bytes is valid UTF-8.

In practice, this works surprisingly well.

As to the problem at the front end: I don't think there's any getting around a basic understanding of character sets and unicode's various encodings.

It's worse than that (but not really Netbeans' fault). Or maybe not important at all. If you have a mixed IDE environment, some using NB6, some NB5, some Vim, some Eclipse.... perhaps some developers in Tokyo, some on Moscow, some in London.... you will already have gone through some pain and have sorted out your charset strategy. You'll probably even use a VCS that understands different charsets and converts. Or more likely, you'll have put all localisable text in properties files and mandated simple ASCII or latin1 for source files. The point you highlight is valid, but probably only for very small teams. And for winning the blogging contest :-) good luck!

As the encoding is project wide, it ought to check all the files in the project not just those currently open. Then there are two cases, first you are just declaring the encoding of existing files, second where you want to change the encoding for all files.

I have a similar problem. I originally set my project charset as tis-620 and later decided to change it to utf-8. Turns out all files change charset to iso-8859-1 and will not change back to anything else. New files , old files. The project properties still show character encoding as utf-8 but page properties show 8859-1 and it will not save as anything else. So now I switch to using Dreamweaver. Any idea what I can do to fix it?