Is it safe to assume decoded percent-encoded URIs turn into UTF-8? Is it safe to assume decoded percent-encoded URIs turn into UTF-8? php php

Is it safe to assume decoded percent-encoded URIs turn into UTF-8?


Thank you for all the comments and answers! I have done some digging myself after I posted the question and would like to write it down here as a reference. Please let me know if this answer is wrong.

Skip to the end to go directly to the conclusion.

From the JETTY Docs on International Characters and Character Encoding,from the section "International characters in URLs", I found theseparagraphs:

Due to the lack of a standard, different browers took different approaches to the character encoding used. Some use the encoding of the page and some use UTF-8. Some drafts were prepared by various standards bodies suggesting that UTF-8 would become the standard encoding. Older versions of jetty (eg 4.0.x series) used UTF-8 as the default in anticipation of a standard being adopted. As a standard was not forthcoming, jetty-4.1.x reverted to a default encoding of ISO-8859-1.

The W3C organization's HTML standard now recommends the use of UTF-8: http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars and accordingly jetty-6 series uses a default of UTF-8.

On the linked HTML 4.0 spec, there is indeed a recommendationfor clients to encode non-ASCII characters into UTF-8 first beforepercent-encoding it, so we know it has been a recommendation fromW3C since HTML 4.0.

The example used on the page is this:

<A href="http://foo.org/Håkon">...</A>

While it later states that the same encoding should be applied tothe fragment part, it doesn't say that if it also applies to querystring.

Typing URLs into browsers

Firefox

As Pekka already mentioned, based on this link Firefoxsends ISO-8859-1 encoded URI as late as 2007. Reading the link,this seems to be the default behavior for Firefox < 3.0. I'mnot sure if this also applies to Firefox < 3.0 in Mac OS X,since default encoding in Mac is UTF-8.

I've tested Firefox 3.6.13 in Windows XP and Firefox 6 in bothWindows 7 and Mac OS X. The Mac version sends everything inUTF-8, so it's nothing to worry about.

Firefox 3.6.13 and 6 in windows encodes query strings into ISO-8859-1by default, but when you type characters that doesn't exist inISO-8859-1 to the query string (α, for example), Firefox 3switches the encoding of the entire query string to UTF-8. I'mpretty sure this is the same behavior in later versions too.

In Firefox 3.6.13 and 6 in Windows that I tested, the path part ofthe URI is always encoded as UTF-8.

If you type this URL to Firefox 3.6/6 in Windows:

http://localhost/test/ü/ä/index.php?chär=ü

The query string gets encoded as ISO-8859-1, but the 'path' partgets encoded as UTF-8:

http://localhost//test/%C3%BC/%C3%A4/index.php?ch%E4r=%FC

Also to be noted, according to this blog post, Firefox 3.0converts katanaka character ア into &#12450; before percent-encodingit. When I tried to do this in Firefox 3.6.13 in the query stringand the path, the katanaka character gets encoded in UTF-8 correctly.

Opera

Opera 10.10 on Mac encodes the query string part of the URI intoISO-8859-1, even though the default encoding for Mac OS X isUTF-8. The 'path' part gets encoded into UTF-8, just like Firefox.

If you try to type greek alphabet α to the query string it getssent as a question mark.

The same behavior is exhibited by Opera 11.51 in Windows XP.

Safari

Safari 5.1 on Mac always sends everything as UTF-8.Safari 5.1 in Windows exhibit the same behavior.

Chrome

Version 13 on Windows encodes both query string and path asUTF-8. I don't have Chrome on Mac, but it seems safe to assumethat Chrome always sends UTF-8, like Safari.

Internet Explorer

DISCLAIMER: I use IECollection to install multiple versions of IEin one machine, so this may not be IE's natural behavior(anyone can confirm on this?).

IE 6, 7, and 8 in Windows XP encodes 'path' part of the URI intoUTF-8 correctly. Umlauts and greek alphabet typed to the querystring does not get percent encoded though. The query string typedto the address bar seems to be sent in ISO-8859-1, the greek alphabetalpha 'α' in the query string gets transliterated into 'a'.

Conclusion

This is short and incomplete, and I cannot guarantee thecorrectness of it, but it seems that the most common encodingsfor URIs are either ISO-8859-1 and UTF-8 (I have no idea what east asiansuse as their encoding, and it is too exhaustive for me to tryand find out).

Since it is already a recommendation from HTML 4.0, I guess it'ssafe to assume the 'path' part of the URI is always encoded inUTF-8. Firefox 2.0 might still be around, so you must check ifthe encoding is ISO-8859-1 too. If it's not UTF-8 or ISO-8859-1,most likely it's a bad request.

It's theoretically impossible to correctly detect the encoding ofof a string (see here, and here). You can guess, butyou can get the wrong result. So don't rely on encoding detection.

Safe Multibyte Routing

The safest way is just to choose one encoding (UTF-8 is thesafest bet) for your entire application. Then you have to:

  1. Make sure that all your strings are encoded in UTF-8 beforeusing it to build your URI. Properly percent encode your URIafter that.
  2. Make sure all your URL encoded (GET) forms sends their data inthe proper encoding. See this FAQ by Kore Nordmann formore information about making sure your forms send the correctencoding.

Also see this great answer from bobince.

After this, you shouldn't have any problems parsing the URI. Ifthe encoding is not in UTF-8, then it's a bad request, and youcan respond with 404 or 400 page.


Since it is unsafe to assume that anyway ("bad guys don't care"), you can use mb_check_encoding to test for UTF-8 string. UTF has a structure that has a low probability to be conformed to in a string in another encoding.


You don't know. It depends on the person/code that generated the URI.