The Source for Java Technology Collaboration
User: Password:



John O'Conner

John O'Conner's Blog

Charset Pitfalls in JSP/Servlet Containers

Posted by joconner on July 27, 2005 at 01:13 PM | Comments (8)

The J2SE platform has come a long way in internationalization. Some things are just easy...like entering your name in a Swing text field regardless of whether your name is John, José, or 田中 (Tanaka). Unicode prevails within the Java core. Unfortunately, entering non-ASCII text in the J2EE world isn't nearly as easy.

I've been playing around with various web servers recently, paying special attention to how browsers communicate non-ASCII text via GET and POST HTTP commands. While it appears that current browsers take hints from the page or form encoding and send form data back to the server in the same encoding, web servers remain blissfully unaware. They typically assume that the request encoding is ISO-8859-1. So, if my application url-encodes a GET parameter in UTF-8 (a Unicode encoding), the backend server (let's say Tomcat 5.5.9) assumes 8859-1. The result, of course, is that text data becomes mangled almost immediately as it travels through the various tiers of even a simple web-based application.

Here's a simple example JSP page that says Hello, <your name>!:

<%@page pageEncoding="iso-8859-1" contentType="text/html; charset=UTF-8" %>

<html>
    <head>
        <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
        <title>Say Hello!</title>
    </head>
    <body>

        <%
        String name = request.getParameter("NAME");
        if (name == null || name.length() == 0) {
            name = "World";
        }
        %>
        Hello, <%= name %> <br>
    
        <form action="sayhello.jsp" method='GET'>
            <label for='NAME'>Name</label><input type="text" id="NAME" name="NAME"/>
            <button type="submit">Submit</button> 
        </form>
    
    </body>
</html>

Type in "John" and press the submit button. The result is that a URL like this is created:

http://localhost/sayhello.jsp?NAME=John
No problem there. The call to request.getParameter("NAME") retrieves the simple ASCII text without a hitch. Subsequently, the name is output to the HTML stream back to the browser where the expected greeting appears:
Hello, John!

Now type in "José" and submit. The GET URL will look like this:

http://localhost/sayhello.jsp?NAME=Jos%C3%A9
The %C3%A9 is the url-encoded UTF-8 representaion of José. Again, no problem here. The browser is taking its cue from the contentType setting where UTF-8 is specified as the charset. However, the output to the browser is this:
Hello, José
That's not right! What happened?

The web server (again Tomcat 5.5.9 in this case) assumes 8859-1 for the URL encoding as it reads the NAME parameter Jos%C3%A9. From its perspective, the %C3 entity represents the code point 0xC3 (Ã) in charset ISO-8859-1. The %A9 is 0xA9 (©) in ISO-8859-1. Hmmph...but I went through all the trouble to explicitly set the content type both in the JSP tag and in the HTML META tag.

The trouble is that none of this charset information gets sent back to the web server during a GET or POST operation. The server has no way of knowing how to interpret the url-encoded GET parameters, so it assumes ISO-8859-1.

OK, so here's a small oversight in the HTTP or HTML spec...I haven't thought about it enough yet to decide. Regardless, this really affects multilingual communication via HTTP. JSP/Servlet containers and web servers are effectively broken in this area because of it. How should it be resolved?

Fortunately some servers do try to address this. After searching online help sources for several hours, I found out that I can specify URIEncoding="UTF-8" in Tomcat's connector settings within the server.xml file. Now, my Tomcat server correctly reads the URL GET parameters correctly...sending out "Hello, José!" or "Hello, 田中!" as expected. However, there's still a problem.

What if I want to POST some non-ASCII data, presumably to enter into a backend database? All is well since I set that URIEncoding flag, right? Wrong. It turns out that Tomcat (sorry to pick on this particular server), doesn't use this URIEncoding flag for POSTed form data. So, what does it use? ISO-8859-1 of course! So now, I'm back to where I started, and my imaginary application still greets Mr. ç”°ä¸ instead of Mr. 田中. Not good.

Now how do I get around this. Maybe I can set a hidden FORM parameter to the correct charset, read this, reset the request's character encoding via request.setCharacterEncoding(), and be done with it. I searched the online world again...sorry nothing in the Tomcat docs on this, although I did see several requests that a parameter similar to URIEncoding be created to handle POSTed data. That would be nice. I got around my particlar problem by explicitly calling request.setCharacterEncoding("UTF-8") in a control servlet. I passed in the encoding preference via a servlet initialization parameter POST_ENCODING. That's ok, I suppose.

I think it would be easier, though, if there were a more visible standard on this for all JSP/Servlet containers, HTTP servers, or application servers. In the JSP/Servlet container area, Tomcat's URIEncoding goes a long way at least for GET requests. Unfortunately, this isn't a J2EE standard setting in a web.xml file or anything, or it's not obvious to me so far. To make matters worse, each server platform (Tomcat, Weblogic, others) tries to handle this in its own way, creating proprietary solutions all around. I noticed that Weblogic uses entries in its weblogic.xml deployment descriptor to handle the same problem. A standard solution for all containers would be best I think.

The blog server here at java.net seems to handle UTF-8 just fine. That is, the server knows to expect POST data encoded as UTF-8. Does the weblogs.java.net server simply call some method setting the request handler to use UTF-8? Does it read this preference from a properties file? a descriptor file? a command line argument when starting the server? Hmm...anyone at java.net willing to share how you handle this problem?


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • well, we wrestle with this kind of problems everyday here in asia. if add user defined characters to the mix, things get even worse!!

    Posted by: walterc on July 27, 2005 at 09:01 PM

  • I'm not at all surprised that some server side programs and browsers are getting this wrong. However in 2005 the specs are not at all ambiguous. (They used to be, which is part of the problem).

    GET uses URLs. URLs encode non-ASCII characters in UTF-8. Period. Never anything else. In the past this wasn't true, but there's no longer any room for debate on this. RFC 3986 (the URI spec) and RFC 3987 (the IRI spec) are crystal clear on this point. The debate is over. UTF-8 won.

    POST is a little more complex but not really any more ambiguous. Each POST request has an HTTP header. This header should have a Content-type field that specifies the actual character encoding of the body. If this field is missing or does not specify the character set, then the default character set is, I think, ISO-8859-1. Off the top of my head I'm not sure about that last bit. It might be US-ASCII. But again the specs are unambiguous. I just don't happen to remember the details right now.

    In 2005 there's no excuse for any software getting these character set details wrong. The specs are clear if developers would simply take the time to read and understand them.

    Posted by: elharo on July 28, 2005 at 08:41 AM

  • You're right...it's right there in the URI spec (pg 20):

    Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent-encoded to be represented as URI characters.


    So, it appears that even browsers are doing this wrong at this point...regardless of the page encoding, browsers should use UTF-8 in any GET URI. Now the pieces of the puzzle are starting to fit. Browsers are inconsistent, and thus servers are inconsistent. However, that shouldn't be permission to do it incorrectly on the server side either. All it takes is one side to start doing it correctly, then everyone would complain that the other wasn't living up to the spec...problem solved for URIs.

    Gotta run to check out the latest spec on POST data now...thanks for the tip to the latest RFC specs on these matters!

    Posted by: joconner on July 28, 2005 at 09:58 AM

  • Although the URI encoding charset is defined by RFC now, it's still unusable as long as some of the major browsers use iso-8859-1 or cp1250 or whatever....

    Always use <form method="post" enctype="multipart/form-data" action="..."> tag. In most cases it's safe even without the enctype attribute, if you don't use <input type="file" ...>

    Posted by: podlesh on July 29, 2005 at 07:31 AM

  • SUN's Application Server (available via project Glassfish) uses a hidden form field to communicate the encoding of any form params to the server.

    The name of the hidden form field may be specified in a SUN specific deployment descriptor file named sun-web.xml, using the form-hint-field attribute of the <parameter-encoding> element, as in the following example:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE sun-web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Sun ONE Application Server 8.0 Servlet 2.4//EN" "http://www.sun.com/software/sunone/appserver/dtds/sun-web-app_2_4-0.dtd">
    <sun-web-app>
    <locale-charset-info default-locale="">
    <locale-charset-map locale="" charset=""/>
    <parameter-encoding form-hint-field="<charset>"/>
    </locale-charset-info>
    </sun-web-app>


    If specified, the server will search for a form parameter named after the value of the form-hint-field attribute, and pass its value to request.setCharacterEncoding(), before reading any form params.

    This mechanism works for both GET and POST operations.

    Posted by: jluehe on December 05, 2005 at 12:21 PM

  • Hi,

    I had some problems to decode from UTF-8 with all these solutions in this page.


    At last I developed this code, and it runs well. I work with Portlets:


    public static String decodeURL (String url, String decode)
    {
    sun.io.ByteToCharConverter fromUnicode;
    String convertedStr = url;
    try {
    fromUnicode = sun.io.ByteToCharConverter.getConverter(decode);
    fromUnicode.setSubstitutionMode(true);

    char[] convertedChars;

    convertedChars = fromUnicode.convertAll(convertedStr.getBytes());

    convertedStr = new String(convertedChars);
    } catch (UnsupportedEncodingException e) {
    e.printStackTrace();
    }
    catch (MalformedInputException e) {
    e.printStackTrace();
    }



    return convertedStr;
    }


    Daniel Prado Rodríguez

    Posted by: danonneus on March 30, 2006 at 03:21 AM

  • Let me guess...you're passing in "ISO-8859-1" as the "decode" parameter? It appears that your app server is decoding URLs to 8859-1 by default without any respect for the POST or GET headers. Your solution is effective and well-known for this particular problem, but only addresses the symptoms. The problem persists.
    What app server are you using? Version?
    You might also get some help from the expanded article form of this blog. It is here:

    Character Conversions from Browser to Database


    Posted by: joconner on March 30, 2006 at 08:46 AM

  • I think everbody has trouble with this. Here is what i did:
    I defined a properties-file where you can configure the charset/encoding and i use this all over the the place (in a header, every jsp include)


    response.setCharacterEncoding and request.setCharacterEncoding
    <meta http-equiv="Content-Type" content="...">
    in Tomcat's server.xml. But i used useBodyEncodingForURI="true" instead of URIEncoding, so it's variable


    BTW: after switching from ISO-8859-1 to UTF-8 my JavaScript dosen't work anymore (in IE only), becuase there where comment with german special characters (so called umlauts). So you have to include a charset attribute and it works again

    e.g.: <script src="yourfile.js" type="text/javascript" charset="ISO-8859-1"></script>

    Posted by: thesuntoucher on December 15, 2006 at 12:27 AM





Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds