Search |
||
Another solution for non-UTF8 source files in NetBeans 6.1?Posted by joconner on April 5, 2008 at 9:55 PM PDT
Recently I mentioned a potential problem when saving source files in a non-Unicode charset encoding. The potential data loss is significant for large projects. After thinking about the problem a little more, I have a potential solution, a solution that allows you to save to a non-Unicode encoding but also prevents data loss. You are familiar with Here's how it would work. First, NetBeans 6.1 uses UTF-8 for a project's default source code and configuration file encoding, an excellent choice by the way. So, now imagine that your source code has the Euro currency symbol in it. That's Unicode code point
Now, let's imagine that you need to change your project encoding for some reason. So, maybe you choose ISO-8859-1, which doesn't contain the Euro symbol. You can still represent the Euro character, but you'll have to encode it with the \u. Wouldn't it be nice if NetBeans did this for you, creating So, what do you think? Maybe the NetBeans team can get this into the 6.1 product before final release? Have you seen NetBeans 6.1? Give it a try, and blog about it. Who knows, you might win $500! Also posted to joconner.com. »
Related Topics >>
Netbeans Comments
Comments are listed in date ascending order (oldest first)
Submitted by lmineiro on Wed, 2008-04-09 05:54.
This is a real-world issue. I experienced this exact problem with NetBeans on my team.
We're a portuguese company and it's pretty common to have some strings (on our local language) in some projects that don't require internationalization. What happened was that a guy using Mac OS X with iso8859-1 encoding would type something like "explicação" and another guy using some flavour of Linux, NB with utf8 encoding, would get an error saving the file, complaining about the chars.
All of this has been said and I'm not bringing anything new to the discussion, but what we _had_ to do to minimize all these problems was to create a NetBeans plugin that everyone in the team installed and associated the Ctrl+Shift+U shortcut to "Fix Unicode". What it does is replace all the "extended" non-ascii chars to its unicode representation (\uXXXX).
Since we started using this homemade plugin, everybody got used to the Ctrl+Shift+U thing and we never had any strings problem.
The caveats mentioned above never really happened in our real-world work, although I reckon they're valid problems - we never experienced any.
Cheers.
Submitted by goron on Sun, 2008-04-06 08:35.
I think you are thinking mostly of changing from utf-8 to something else (effectively us-ascii - fairly safe), but my experience shows that files frequently move back and forth between different encodings, eg, I save as utf-8, a colleague in NYC opens it, saves in latin1, I open and change it back, etc.
The issues with source control systems (and indeed any other process you might have) is that utf-8 streams might challenge its view of 'text', perhaps treat it as binary, and then you lose the ability to do diffs.
Oh, and whilst Hello\u20ac is valid to the compiler, Netbeans editor complains about it - that is certainly an NB bug to be fixed.
My point, though, is that encoding issues have been around for a long time and this particular change in NB behaviour doesn't really alter things. If anything, the bug should be that NB doesn't do a good enough job of determining the actual encoding of the file (if it makes any attempt at all) and that it doesn't let you choose an encoding per file (the NB RCP framework should make that easy though).
I wonder if Java compiler should support something like VIm modelines, so the top comment in the file could have something like: @encoding utf-8 ? OK, so you've got to read part of the file to get the encoding, but that hasn't hampered XML too much...
Submitted by goron on Sat, 2008-04-05 22:59.
NB61 has a project-level encoding, so do you want it to automatically check out all 5000 files, check that the encoding still holds, and if not, change the text and write it back?
I don't think it's practical - maybe a warning and an offer to do 'match text to new encoding' refactoring or something.
Also, what about the bigger problems of mixed IDE environments? Eg, people using IDEs that can (and do) use different encodings for different types, or arbitrarily on different files?
What about version control systems that might not like changing binary representations of what it thought was text? (ok not a problem going from utf-8 to us-ascii, but what about the other way around).
What about the (c) symbol - that also has problems with latin-1 vs utf-8 (utf8 is c2a9) - should your legal boilerplate at the top of each file have a \u00a9? or more practically, "(c)"?
What about Java indentifiers? Suppose someone has a class called Hello€ (it is legal) - how do you solve that? Or if you had variables called Ω or ß (again, both legal) ?
I think this is a much wider, and quite old problem that in everyplace I've worked, has been solved by policy ("heh, everyone use latin-1 coz clearcase is too stupid to understand utf-8 properly").
Submitted by joconner on Sun, 2008-04-06 00:09.
@goron, thanks for your comments. NB will save the file in the new encoding if a user changes that encoding...even if 5000 files are involved. If NB is going to rewrite all the files, it might as well do so nicely.
Mixed IDE environments shouldn't have a problem with \u-encoded files. Any editor can read those files, so I'm not sure what bigger problem you mean? Can you explain?
I'm not sure how version control systems would be more challenged by \u-encoded characters in files any more than any other change in the file. If you change the encoding in NB, NB will write out the file in the new encoding. A version control system will have to deal with that regardless of my suggestion.
If the target encoding can't represent U+00A9, I think writing \u00A9 to the file is at least as helpful as writing out '?'...at least the meaning of the data is maintained.
Indeed, Hello€ is a legal identifier. By the way, I maintained and updated the method Character.isJavaIdentifier()...but that's another story. And Hello\u20AC is legal too...and represents the same identifier. But changing this identifier to
Hello? isn't a good thing...and neither is it legal. |
||
|
|