Skip to main content

Charset Pitfalls in JSP/Servlet Containers

Posted by joconner on July 27, 2005 at 1:13 PM PDT

The J2SE platform has come a long way in internationalization. Some things are just easy...like entering your name in a Swing text field regardless of whether your name is John, José, or 田中 (Tanaka). Unicode prevails within the Java core. Unfortunately, entering non-ASCII text in the J2EE world isn't nearly as easy.

I've been playing around with various web servers recently, paying special attention to how browsers communicate non-ASCII text via GET and POST HTTP commands. While it appears that current browsers take hints from the page or form encoding and send form data back to the server in the same encoding, web servers remain blissfully unaware. They typically assume that the request encoding is ISO-8859-1. So, if my application url-encodes a GET parameter in UTF-8 (a Unicode encoding), the backend server (let's say Tomcat 5.5.9) assumes 8859-1. The result, of course, is that text data becomes mangled almost immediately as it travels through the various tiers of even a simple web-based application.

Here's a simple example JSP page that says Hello, !:

<%@page pageEncoding="iso-8859-1" contentType="text/html; charset=UTF-8" %>

<html>
    <head>
        <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
        <title>Say Hello!</title>
    </head>
    <body>

        <%
        String name = request.getParameter("NAME");
        if (name == null || name.length() == 0) {
            name = "World";
        }
        %>
        Hello, <%= name %> <br>
   
        <form action="sayhello.jsp" method='GET'>
            <label for='NAME'>Name</label><input type="text" id="NAME" name="NAME"/>
            <button type="submit">Submit</button>
        </form>
   
    </body>
</html>

Type in "John" and press the submit button. The result is that a URL like this is created:

http://localhost/sayhello.jsp?NAME=John

No problem there. The call to request.getParameter("NAME") retrieves the simple ASCII text without a hitch. Subsequently, the name is output to the HTML stream back to the browser where the expected greeting appears:
Hello, John!

Now type in "José" and submit. The GET URL will look like this:

http://localhost/sayhello.jsp?NAME=Jos%C3%A9

The %C3%A9 is the url-encoded UTF-8 representaion of José. Again, no problem here. The browser is taking its cue from the contentType setting where UTF-8 is specified as the charset. However, the output to the browser is this:
Hello, José

That's not right! What happened?

The web server (again Tomcat 5.5.9 in this case) assumes 8859-1 for the URL encoding as it reads the NAME parameter Jos%C3%A9. From its perspective, the %C3 entity represents the code point 0xC3 (Ã) in charset ISO-8859-1. The %A9 is 0xA9 (©) in ISO-8859-1. Hmmph...but I went through all the trouble to explicitly set the content type both in the JSP tag and in the HTML META tag.

The trouble is that none of this charset information gets sent back to the web server during a GET or POST operation. The server has no way of knowing how to interpret the url-encoded GET parameters, so it assumes ISO-8859-1.

OK, so here's a small oversight in the HTTP or HTML spec...I haven't thought about it enough yet to decide. Regardless, this really affects multilingual communication via HTTP. JSP/Servlet containers and web servers are effectively broken in this area because of it. How should it be resolved?

Fortunately some servers do try to address this. After searching online help sources for several hours, I found out that I can specify URIEncoding="UTF-8" in Tomcat's connector settings within the server.xml file. Now, my Tomcat server correctly reads the URL GET parameters correctly...sending out "Hello, José!" or "Hello, 田中!" as expected. However, there's still a problem.

What if I want to POST some non-ASCII data, presumably to enter into a backend database? All is well since I set that URIEncoding flag, right? Wrong. It turns out that Tomcat (sorry to pick on this particular server), doesn't use this URIEncoding flag for POSTed form data. So, what does it use? ISO-8859-1 of course! So now, I'm back to where I started, and my imaginary application still greets Mr. ç”°ä¸ instead of Mr. 田中. Not good.

Now how do I get around this. Maybe I can set a hidden FORM parameter to the correct charset, read this, reset the request's character encoding via request.setCharacterEncoding(), and be done with it. I searched the online world again...sorry nothing in the Tomcat docs on this, although I did see several requests that a parameter similar to URIEncoding be created to handle POSTed data. That would be nice. I got around my particlar problem by explicitly calling request.setCharacterEncoding("UTF-8") in a control servlet. I passed in the encoding preference via a servlet initialization parameter POST_ENCODING. That's ok, I suppose.

I think it would be easier, though, if there were a more visible standard on this for all JSP/Servlet containers, HTTP servers, or application servers. In the JSP/Servlet container area, Tomcat's URIEncoding goes a long way at least for GET requests. Unfortunately, this isn't a J2EE standard setting in a web.xml file or anything, or it's not obvious to me so far. To make matters worse, each server platform (Tomcat, Weblogic, others) tries to handle this in its own way, creating proprietary solutions all around. I noticed that Weblogic uses entries in its weblogic.xml deployment descriptor to handle the same problem. A standard solution for all containers would be best I think.

The blog server here at java.net seems to handle UTF-8 just fine. That is, the server knows to expect POST data encoded as UTF-8. Does the weblogs.java.net server simply call some method setting the request handler to use UTF-8? Does it read this preference from a properties file? a descriptor file? a command line argument when starting the server? Hmm...anyone at java.net willing to share how you handle this problem?

Related Topics >>