Skip to main content

Encoding URLs for non-ASCII query params

Posted by joconner on June 30, 2011 at 12:31 AM PDT

NOTE: I've updated this blog to avoid "mojibake" -- garbled characters. For some reason, the word TA in TANAKA in the name query key in examples was garbled. Originally all occurences of the word 中 were prefixed with the character for TA...for the common family name TANAKA. After removing the mojibake, I think you can still understand the purpose of the blog. But it does make me ask....just what is wrong with displaying TA under this system.

Are you a web service API developer? A good one? Even great? Wherever you are on the greatness spectrum, I have a tip today that is going to make you better. But first, I have something to say. Yes, it is obvious really, but it's worth repeating. The web truly is a world-wide web. Unfortunately, a great number of globally unaware developers are on the global web. This creates an odd situation in which web services are globally accessible but only locally or regionally aware.

There are a few important things to remember when creating a global web service. Let's just cover ONE today: non-ASCII query parameters are valid, useful, and often necessary for a decent, global web service.

It seems so obvious to me, and it probably does to you. Sometimes a service needs to exchange or process non-ASCII data. The world is a big place, and although English is an important part of the global web, more people speak a different language. English is a big percent, but lots of people use Chinese or an Indic language too. Let's make sure your web service can process all those non-ASCII characters in English or any other language!

Let's look at some examples of non-ASCII query params:

In these examples, you must perform two steps to get the query params (both keys and values) into the correct form:

  1. Convert the keys and their values to UTF-8 if they are not already.
  2. Perform the "percent encoding" on each UTF-8 code unit

To do #1, you'll need to use whatever character conversion utility you have: iconv, the Java charset encoding converters, whatever.

The #2 step is the important one for this blog. For each hexadecimal code unit in the UTF-8 query portion, you must "percent encode" the code unit. Let's look at the first example query params:

name=中&city=東京

The JavaScript function encodeURI actually does a good job of doing this for us:

encodeURI("name=中&city=東京") produces the string:

name=%E4%B8%AD&city=%E6%9D%B1%E4%BA%AC

Notice that you should also include this encoding for the keys in the param list. In the next example, I've used Japanese values for both keys and values.

encodeURI("名前=中&市=東京") produces this string:

%E5%90%8D%E5%89%8D=%E4%B8%AD&%E5%B8%82=%E6%9D%B1%E4%BA%AC"

Note that both the keys and vaues have been "percent encoded".

On the server side, your server will understand how to decode these values into their correct UTF-8 string values if you have configured it correctly. Correct configuration of a server usually involves a charset conversion filter for a servlet container and sometimes just a config setting for Apache.

More on this at a later time.

Related Topics >>

Comments

<p>&nbsp;I have tried posting this particular blog 2x. ...

 I have tried posting this particular blog 2x. However, *something* is changing the content. All the non-ASCII content is being mangled. This is typical in a system in which there is an inconsistency among the charsets used at different points in the technology stack: drupal, database, app server. In this case I suspect that your database is not set up to properly store UTF-8 content, and is mangling it. See the misformed characters in the above blog.

<p>&nbsp;So far I couldn't find in the specification that ...

So far I couldn't find in the specification that get parameters couldn't be in utf-8 (I might be wrong).
I have seen websites that use greek characters within get parameters and this get's indexed and linked properly in search engines.
What RFC states it's not possible?

<p>I have one web application, and one of the parameters is ...

I have one web application, and one of the parameters is sector. The tipical URL goes like this: http://www.example.com/MyApp/servlet?param1=abc&sector=2...

Well, I recieve a lot of requests with the parameter §or=2. Why? Because some browsers and search engines translate "&sect" as an HTML entity character, although the standard says it must be "&sect;".

Next time I choose the name of a parameter, I will checkout the HTML specifications :-)