Skip to main content

Another solution for non-UTF8 source files in NetBeans 6.1?

Posted by joconner on April 5, 2008 at 9:55 PM PDT

Recently I mentioned a potential problem when saving source files in a non-Unicode charset encoding. The potential data loss is significant for large projects. After thinking about the problem a little more, I have a potential solution, a solution that allows you to save to a non-Unicode encoding but also prevents data loss.

You are familiar with \u notation for non-ascii characters in property files? I think the same encoding can work for non-ascii characters in any Java source file. I'm not suggesting that should be the preferred representation. I think the \u notation is only tolerable, something to be avoided whenever possible. However, in this situation -- saving files that were once in UTF-8 -- this might be the only option for storing the files without data loss.

Here's how it would work. First, NetBeans 6.1 uses UTF-8 for a project's default source code and configuration file encoding, an excellent choice by the way. So, now imagine that your source code has the Euro currency symbol in it. That's Unicode code point U+20AC. And the character itself is this: €. If you can't see the actual character (maybe you don't have a capable font?), here's the image instead:

Euro glyph

Now, let's imagine that you need to change your project encoding for some reason. So, maybe you choose ISO-8859-1, which doesn't contain the Euro symbol. You can still represent the Euro character, but you'll have to encode it with the \u. Wouldn't it be nice if NetBeans did this for you, creating \u20AC in your file instead of converting the character to a meaningless ? question mark. I think that would be better. And it's entirely possible. It doesn't prevent NetBeans from converting the file to the target encoding as requested by the user, and it allows NetBeans to prevent data loss by using the \u encoding for characters that are not in the target charset.

So, what do you think? Maybe the NetBeans team can get this into the 6.1 product before final release?

Have you seen NetBeans 6.1? Give it a try, and blog about it. Who knows, you might win $500!

Also posted to joconner.com.

Related Topics >>

Comments

This is an old thread, but it showed up in Google as one of ...

This is an old thread, but it showed up in Google as one of the first results when I was having issues, again, with the dreaded "UTF-8 this file is not safe to open blah blah" from Netbeans.

I just discovered that my issue with this was related to Windows. I noticed that not all of the TPL or PHP files in my project were giving me this error in Netbeans. I found that it was only files that I had opened previously for a "quick and dirty" edit using Notepad. Little did a realize that every time I opened one of these files in Notepad and re-saved it, Notepad was taking it upon itself to open it up in UTF-8 encoding and, without saying anything, saved it back as ANSI encoding. Now I know why so many developers prefer a Linux platform for development.

I have banged my head against the desk on several occasions when projects were being plagued with this issue in Netbeans. The fix is to simply reopen the file in question in Notepad, choose Save As and re-save the file explicitly specifying UTF-8 as the encoding type and viola....no more complaints from Netbeans. This fix has been tested in Windows 7 Uttimate Edition. Your mileage may vary with other versions of Windows (and other versions of Notepad).

Just thought I would post this for anyone else that might be plagued by Netbeans UTF-8 errors who might be in the same boat I was.

@goron, thanks for your comments. NB will save the file in the new encoding if a user changes that encoding...even if 5000 files are involved. If NB is going to rewrite all the files, it might as well do so nicely.

Mixed IDE environments shouldn't have a problem with \u-encoded files. Any editor can read those files, so I'm not sure what bigger problem you mean? Can you explain?

I'm not sure how version control systems would be more challenged by \u-encoded characters in files any more than any other change in the file. If you change the encoding in NB, NB will write out the file in the new encoding. A version control system will have to deal with that regardless of my suggestion.

If the target encoding can't represent U+00A9, I think writing \u00A9 to the file is at least as helpful as writing out '?'...at least the meaning of the data is maintained.

Indeed, Hello€ is a legal identifier. By the way, I maintained and updated the method Character.isJavaIdentifier()...but that's another story. And Hello\u20AC is legal too...and represents the same identifier. But changing this identifier to Hello? isn't a good thing...and neither is it legal.

NB61 has a project-level encoding, so do you want it to automatically check out all 5000 files, check that the encoding still holds, and if not, change the text and write it back?

I don't think it's practical - maybe a warning and an offer to do 'match text to new encoding' refactoring or something.

Also, what about the bigger problems of mixed IDE environments? Eg, people using IDEs that can (and do) use different encodings for different types, or arbitrarily on different files?

What about version control systems that might not like changing binary representations of what it thought was text? (ok not a problem going from utf-8 to us-ascii, but what about the other way around).

What about the (c) symbol - that also has problems with latin-1 vs utf-8 (utf8 is c2a9) - should your legal boilerplate at the top of each file have a \u00a9? or more practically, "(c)"?

What about Java indentifiers? Suppose someone has a class called Hello€ (it is legal) - how do you solve that? Or if you had variables called Ω or ß (again, both legal) ?

I think this is a much wider, and quite old problem that in everyplace I've worked, has been solved by policy ("heh, everyone use latin-1 coz clearcase is too stupid to understand utf-8 properly").

I think you are thinking mostly of changing from utf-8 to something else (effectively us-ascii - fairly safe), but my experience shows that files frequently move back and forth between different encodings, eg, I save as utf-8, a colleague in NYC opens it, saves in latin1, I open and change it back, etc.

The issues with source control systems (and indeed any other process you might have) is that utf-8 streams might challenge its view of 'text', perhaps treat it as binary, and then you lose the ability to do diffs.

Oh, and whilst Hello\u20ac is valid to the compiler, Netbeans editor complains about it - that is certainly an NB bug to be fixed.

My point, though, is that encoding issues have been around for a long time and this particular change in NB behaviour doesn't really alter things. If anything, the bug should be that NB doesn't do a good enough job of determining the actual encoding of the file (if it makes any attempt at all) and that it doesn't let you choose an encoding per file (the NB RCP framework should make that easy though).

I wonder if Java compiler should support something like VIm modelines, so the top comment in the file could have something like: @encoding utf-8 ? OK, so you've got to read part of the file to get the encoding, but that hasn't hampered XML too much...

This is a real-world issue. I experienced this exact problem with NetBeans on my team.

We're a portuguese company and it's pretty common to have some strings (on our local language) in some projects that don't require internationalization. What happened was that a guy using Mac OS X with iso8859-1 encoding would type something like "explicação" and another guy using some flavour of Linux, NB with utf8 encoding, would get an error saving the file, complaining about the chars.

All of this has been said and I'm not bringing anything new to the discussion, but what we _had_ to do to minimize all these problems was to create a NetBeans plugin that everyone in the team installed and associated the Ctrl+Shift+U shortcut to "Fix Unicode". What it does is replace all the "extended" non-ascii chars to its unicode representation (\uXXXX).

Since we started using this homemade plugin, everybody got used to the Ctrl+Shift+U thing and we never had any strings problem.

The caveats mentioned above never really happened in our real-world work, although I reckon they're valid problems - we never experienced any.

Cheers.