Skip to main content

 and nbsp mystery explained

Posted by kohsuke on January 11, 2008 at 7:10 PM PST

"non-breaking space" character, which is known as Unicode code point 160 (written as U+00A0), AKA " " in HTML, is often used to force browsers to put whitespace. This is particularly so since "space" characters (U+0020) are normalized by them.

When a non-breaking space character is sent to the browser, it is first encoded into a sequence of bytes for transmission. If the server chooses UTF-8 for encoding, this character is converted into two bytes, "C2 A0" (for those who are curious, see UTF-8 encoding rule for yourself.)

Now, if a browser decodes this with UTF-8, everything is happy. But often for various reasons it fails to pick up the correct encoding, and instead it often ends up using iso-8859-1, as this is often set as the system default encoding, especially in the U.S.

When the byte sequence "C2 A0" is interpreted as iso-8859-1, this is decoded into two characters, "A circumflex" followed by "non-breaking space". That's why you see a strange "

Related Topics >>


And that's just reading a document... sending a request an interpreting the parameters is another source of headaches.

I guess developers that live in countries with languages outside the ASCII realm get burnt very soon with that :)

kohsuke, Sometimes what seems most trivial can be monumentous. It seems to me that if I'm intending my web server/browser to use UTF-8 and it is in fact using iso-8859-1, other problems that I can't easily identify will crop up also. Forcing UTF-8 should rectify those other issues! Thanks for the tip. Gary

Thanks for the pointers. I think the one posted by felipegaucho is worth the read if people are not familiar with this area. It clearly distinguishes the code points and encodings, which is IMO the most significant distinction one needs to be aware of.

greeneyed -- Yes, there are a lot of things to be covered when it comes to these issues. For query parameters, see elharo's comment on this post.