What factors make PHP Unicode-incompatible? What factors make PHP Unicode-incompatible? php php

What factors make PHP Unicode-incompatible?


When PHP was started several years ago, UTF-8 was not really supported. We are talking about a time when non-Unicode OS like Windows 98/Me was still current and when other big languages like Delphi were also non-Unicode. Not all languages were designed with Unicode in mind from day 1, and completely changing your language to Unicode without breaking a lot of stuff is hard. Delphi only became Unicode compatible a year or two ago for example, while other languages like Java or C# were designed in Unicode from Day 1.

So when PHP grew and became PHP 3, PHP 4 and now PHP 5, simply no one decided to add Unicode. Why? Presumably to keep compatible with existing scripts or because utf8_de/encode and mb_string already existed and work. I do not know for sure, but I strongly believe that it has something to do with organic growth. Features do not simply exist by default, they have to be written by someone, and that simply did not happen for PHP yet.

Edit: Ok, I read the question wrong. The question is: How are strings stored internally? If I type in "Währung" or "Écriture", which Encoding is used to create the bytes used? In case of PHP, it is ASCII with a Codepage. That means: If I encode the string using ISO-8859-15 and you decode it with some chinese codepage, you will get weird results. The alternative is in languages like C# or Java where everything is stored as Unicode, which means: There is no codepage anymore, and theoretically you cannot mess up. I recommend Joel's article about Unicode and Character Sets, but essentially it boils down to: How are strings stored internally, and the answer with PHP is "Not in Unicode", which means that you have to be very careful and explicit when processing strings to make sure to always keep the string in the proper encoding during input, storage (database) and output, which is very errorprone.


i believe it is largely a cultural difficulty, not a technical one.

as for the technical problems---and its not downright all-trivial to implement unicode in an ecosystem built on the assumptions that 'one character equals one byte'---the developers could have copied much of java's or python's efforts (the latter with decent and largely working unicode compatibility since around 2001), but they never did.

when i read the discussion thread attached to the official, current documentation for php's utf8_encode() function, i get a feeling of vertigo.

firstoff, that function is called utf8_encode(); however, the documentation states that the string it expects is expected to be in ISO-8859-1 (a.k.a. latin-1). that's sooo php, that's sooo 80s.

most commenters seem to perceive unicode as a burden. there are many proposals how to convert strings 'of unknown content', how to deal with s'strings with mixed encodings' (wtf?), or dealing with codepoints that normally cause breakage because they are beyond that function's four-bytes-per-codepoint limit.

the discussion is centered around fixups to get rid of squiggles or to avoid the problematic parts of that function's behavior. and that, to me, is sooo php: everyone's just doing fixes, few things are implemented in a fundamentally correct way. if you believe this to be slander on my side, here are some tidbits:

Although this seems to break german Umlaute [äöü] if the document is already UTF-8.

(failure to understand that utf-8 is not designed to work when applied twice)

Look at iconv() function, which offers a way to convert from 8859 and dreaded 1252 into UTF8

(good point: neglection of prior art on part of the php developers; instead, buggy own implementation)

use of preg_match to detect if utf8_encode is needed [...] excluding surrogates [...] excluding overlongs

(suggesting to silently erase all problematic content from strings, leaving only those things that do not break utf8_encode(); this may make texts unreadable (or vanish altogether), but hey, no more error messages)

to encode a string only if it is not yet UTF-8 [...] mb_detect_encoding($s, "UTF-8")

(as pointed out by another commenter, this is not going to work:

$str = 'áéóú'; // ISO-8859-1mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'mb_detect_encoding($str, 'UTF-8', true); // false

so here we're looking at one bug being replaced by another one. happy hunting. also, what they seem to propose here is to solve a problem using heuristics (slow, uncertain) means that could and should be solved with mechanical (fast, certain) means)

utf8_[encode|decode] will actually translate windows-1252 characters as well, not just from/to ISO-8859-1 as the documentation says

(you cannot ever rely on the official php documentation to be clear or exhaustive---you must always read through years of users experience which no-one will ever feed back to the docs)

I've been working on a is_utf8 function and wanted to post it here, in addition to others i also took in consideration the 5000 char bug

(a fix for a problem that largely only exists because unicode is not properly implemented. we also learn that not only will the utf8_encode() function give up beyond 4 bytes per codepoint, it will also break if the resulting (or output?) text exceeds a limit of 5000 characters)

i could go on and on like this. you already get the idea: judging from this thread, the php community simply does not sound like they're anywhere ready to grasp what encodings and character sets are all about, what it takes to build a sound infrastructure in general or, specifically, to implement unicode in a proper way. instead, they're using their scaffolds, their cardboards, their nails and hammers and go on building this grand edifice called php, throwing their duct tape at every problems that can't be undone with another nail. of course, that building will suffer from every wind that comes blowing, such as the occasional legal but unexpected character.

seeing this particular thread being active for eight years does not exactly instill confidence the situation is going to be any better in eight years from now.


The concept of a "multibyte character" is at the core of the problem.

  1. It leaks an implementation detail: you should be able to work with the abstraction of a character without knowing how the implementers choose to represent the data - maybe depending on the platform it suits them to represent everything as UTF16 or UTF32, in which case everything is multibyte, not that the users of the character abstraction should care.
  2. It's a kludge: On top of an out-of-date habit of thought where we all "really know" that strings are byte sequences, we now have to know that sometimes the bytes clump together into things known as Unicode characters, and have special cases all over the place to deal with it.
  3. It's like a mouse trying to eat an elephant. By framing Unicode as an extension of ASCII (we have normal strings and we have mb_strings) it gets things the wrong way around, and gets hung up on what special cases are required to deal with characters with funny squiggles that need more than one byte. If you treat Unicode as providing an abstract space for any character you need, ASCII is accommodated in that without any need to treat it as a special case.