The Source for Java Technology Collaboration
User: Password:



Kohsuke Kawaguchi

Kohsuke Kawaguchi's Blog

 and nbsp mystery explained

Posted by kohsuke on January 11, 2008 at 07:10 PM | Comments (6)

"non-breaking space" character, which is known as Unicode code point 160 (written as U+00A0), AKA " " in HTML, is often used to force browsers to put whitespace. This is particularly so since "space" characters (U+0020) are normalized by them.

When a non-breaking space character is sent to the browser, it is first encoded into a sequence of bytes for transmission. If the server chooses UTF-8 for encoding, this character is converted into two bytes, "C2 A0" (for those who are curious, see UTF-8 encoding rule for yourself.)

Now, if a browser decodes this with UTF-8, everything is happy. But often for various reasons it fails to pick up the correct encoding, and instead it often ends up using iso-8859-1, as this is often set as the system default encoding, especially in the U.S.

When the byte sequence "C2 A0" is interpreted as iso-8859-1, this is decoded into two characters, "A circumflex" followed by "non-breaking space". That's why you see a strange "Â" (followed by space, which you can't see.)

When this happens, what you need to find out is why the browser is choosing the incorrect encoding. It's hard to list possible causes exhaustively, but the typical ones are:

  • You wrote a static HTML file in UTF-8, but your web server doesn't know that , so it doesn't send the HTTP Content-Type header with proper charset. Thus the browser ends up making a guess at the encoding, and it fails.
  • You wrote a web application, but it's not sending the Content-Type header. ServletResponse.setCharacterEncoding("UTF-8") is your friend.
  • Putting UTF-8 BOM at the beginning of the HTML document helps browser detects the right encoding.
  • HTTP meta tag can be used to re-iterate the encoding, like "<META http-equiv="Content-Type" content="text/html;charset=UTF-8">"

Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

    Posted by: felipegaucho on January 12, 2008 at 01:22 AM

  • http://jelmer.jteam.nl/2007/08/12/on-character-set-encodings/

    Posted by: jkuperus on January 12, 2008 at 12:56 PM

  • Thanks for the pointers. I think the one posted by felipegaucho is worth the read if people are not familiar with this area. It clearly distinguishes the code points and encodings, which is IMO the most significant distinction one needs to be aware of.

    Posted by: kohsuke on January 14, 2008 at 09:25 AM

  • kohsuke,

    Sometimes what seems most trivial can be monumentous. It seems to me that if I'm intending my web server/browser to use UTF-8 and it is in fact using iso-8859-1, other problems that I can't easily identify will crop up also.

    Forcing UTF-8 should rectify those other issues!

    Thanks for the tip.

    Gary

    Posted by: gthomps on January 14, 2008 at 03:29 PM

  • And that's just reading a document... sending a request an interpreting the parameters is another source of headaches.

    I guess developers that live in countries with languages outside the ASCII realm get burnt very soon with that :)

    Posted by: greeneyed on January 15, 2008 at 05:48 AM

  • greeneyed — Yes, there are a lot of things to be covered when it comes to these issues. For query parameters, see elharo's comment on this post.

    Posted by: kohsuke on January 15, 2008 at 09:42 AM



Only logged in users may post comments. Login Here.


Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds