Skip to main content

Character Conversion points

Posted by joconner on April 13, 2008 at 9:07 AM PDT

You'd think this sort of problem would be resolved by now, but it's not. It's still almost impossible to quickly and easily migrate an application from the too common default Latin-1 to UTF-8 character set encoding. The problem isn't that UTF-8 can't handle the conversion. No, that's definitely not it. UTF-8 can represent any Latin-1 character and much, much more. The problem is that the Latin-1 charset is so deeply ingrained as the default in every software interface that you just have so many faulty conversion points. A conversion point is a handoff point between one software component and another, a place where character encodings matter and where faulty conversions are way too common.

Here's an example: a simple web application that stores names and addresses in a database. Chances are, if you haven't done anything explicit to change this, the web page itself will have no charset encoding associated with it. And neither will your application server. And neither will your database. And without explicit settings, many applications use Latin-1 as the default character set. So, you'll be able to enter, store, retrieve, and display common Western European names, but you won't be able to handle Russian or Japanese or Chinese or, well, you get the idea.

So let's imagine you decide to convert from Latin-1 to UTF-8 so that you open up your application to the rest of the world's languages and scripts. What does that mean? What must you do? How do you start?

Here are some of the charset conversion points you'll need to resolve as you migrate through this problem:

  1. database tables
  2. database connections
  3. application and/or web server frameworks
  4. web page
  5. form encodings
  6. JavaScript or other browser scripts

To help you get started, I've discussed the first 4 conversion points in the article Character Conversions from Browser to Database. Go ahead, take a look. But come back here to let me know what you think. Old article, yes, for certain. However, I just ran head on into this very problem just this week. The same problems never go away, and this article had the content my team needed to resolve it in our environment.

I'll talk about some of the JavaScript issues in an upcoming blog.

Also posted to joconner.com.

Related Topics >>

Comments

Please keep up these great articles, internationalization is a tough topic to find good content on. I am writing an EMR for physician offices and designed it to handle multiple languages/character sets - see the Japanese screenshot at the bottom of this page:

This is a Swing client (JBoss appserver) and the database holds all the language translations and UI text display

http://www.patientos.org/software/index.html

Obviously more work to be done though I find some additional challenges e.g. on one screen the EditorPane displaying HTML was not displaying the Japanese characters

@caultonpos, if you enjoy posts about i18n topics, you might also be interested in a more recent post, Encoding URIs and Their Components. Thanks for your comments!