Skip to main content

 and nbsp mystery explained

Posted by kohsuke on January 11, 2008 at 7:10 PM PST

"non-breaking space" character, which is known as Unicode code point 160 (written as U+00A0), AKA " " in HTML, is often used to force browsers to put whitespace. This is particularly so since "space" characters (U+0020) are normalized by them.

When a non-breaking space character is sent to the browser, it is first encoded into a sequence of bytes for transmission. If the server chooses UTF-8 for encoding, this character is converted into two bytes, "C2 A0" (for those who are curious, see UTF-8 encoding rule for yourself.)

Now, if a browser decodes this with UTF-8, everything is happy. But often for various reasons it fails to pick up the correct encoding, and instead it often ends up using iso-8859-1, as this is often set as the system default encoding, especially in the U.S.

When the byte sequence "C2 A0" is interpreted as iso-8859-1, this is decoded into two characters, "A circumflex" followed by "non-breaking space". That's why you see a strange "Â" (followed by space, which you can't see.)

When this happens, what you need to find out is why the browser is choosing the incorrect encoding. It's hard to list possible causes exhaustively, but the typical ones are:

  • You wrote a static HTML file in UTF-8, but your web server doesn't know that , so it doesn't send the HTTP Content-Type header with proper charset. Thus the browser ends up making a guess at the encoding, and it fails.
  • You wrote a web application, but it's not sending the Content-Type header. ServletResponse.setCharacterEncoding("UTF-8") is your friend.
  • Putting UTF-8 BOM at the beginning of the HTML document helps browser detects the right encoding.
  • HTTP meta tag can be used to re-iterate the encoding, like "<META http-equiv="Content-Type" content="text/html;charset=UTF-8">"
Related Topics >>

Comments

And that's just reading a document... sending a request an interpreting the parameters is another source of headaches.

I guess developers that live in countries with languages outside the ASCII realm get burnt very soon with that :)

kohsuke, Sometimes what seems most trivial can be monumentous. It seems to me that if I'm intending my web server/browser to use UTF-8 and it is in fact using iso-8859-1, other problems that I can't easily identify will crop up also. Forcing UTF-8 should rectify those other issues! Thanks for the tip. Gary

Thanks for the pointers. I think the one posted by felipegaucho is worth the read if people are not familiar with this area. It clearly distinguishes the code points and encodings, which is IMO the most significant distinction one needs to be aware of.

greeneyed -- Yes, there are a lot of things to be covered when it comes to these issues. For query parameters, see elharo's comment on this post.

http://jelmer.jteam.nl/2007/08/12/on-character-set-encodings/