Detect encoding and make everything UTF-8 Detect encoding and make everything UTF-8 php php

Detect encoding and make everything UTF-8


If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

Usage:

require_once('Encoding.php');use \ForceUTF8\Encoding;  // It's namespaced now.$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

Usage:

require_once('Encoding.php');use \ForceUTF8\Encoding;  // It's namespaced now.$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");echo Encoding::fixUTF8("Fédération Camerounaise de Football");echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de FootballFédération Camerounaise de FootballFédération Camerounaise de FootballFédération Camerounaise de Football

I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().


You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.


Edit   Here is what I probably would do:

I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.

$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';$accept = array(    'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),    'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit')));$header = array(    'Accept: '.implode(', ', $accept['type']),    'Accept-Charset: '.implode(', ', $accept['charset']),);$encoding = null;$curl = curl_init($url);curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);curl_setopt($curl, CURLOPT_HEADER, true);curl_setopt($curl, CURLOPT_HTTPHEADER, $header);$response = curl_exec($curl);if (!$response) {    // error fetching the response} else {    $offset = strpos($response, "\r\n\r\n");    $header = substr($response, 0, $offset);    if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {        // error parsing the response    } else {        if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {            // type not accepted        }        $encoding = trim($match[2], '"\'');    }    if (!$encoding) {        $body = substr($response, $offset + 4);        if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {            $encoding = trim($match[1], '"\'');        }    }    if (!$encoding) {        $encoding = 'utf-8';    } else {        if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {            // encoding not accepted        }        if ($encoding != 'utf-8') {            $body = mb_convert_encoding($body, 'utf-8', $encoding);        }    }    $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);    if (!$simpleXML) {        // parse error    } else {        echo $simpleXML->asXML();    }}


Detecting the encoding is hard.

mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.

As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.

Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.