PHP: UTF 8 characters encoding PHP: UTF 8 characters encoding curl curl

PHP: UTF 8 characters encoding


Your page is being served as UTF-8 so I'd point my finger at the database.

Make sure the connection is in UTF-8 before any SELECTs or INSERTS - in MySQL:

SET NAMES "utf8"


Just a quick note about CURLOPT_ENCODING : it's the Accept-Encoding header, which is not the same at all as character encoding. Supported accept encodings are "identity", "deflate", and "gzip".


Like all debugging, you start by isolating the problem:

I am scraping a list of RSS feeds by using cURL, - look at the xml from the RSS feed that's giving the problem (there's more than one feed, so it's possible for some feeds to be right and for the feeds that are wrong to be wrong in different ways)

and then I am reading and parsing the RSS data with SimpleXML. - print out the field that SimpleXML read out - is it ok or does a problem show up?

The sorted data is then inserted into a mySQL database. - print out hex(field), length(field), and char_length(field) for the piece of data that's giving the problem.

EDIT

Take the feed http://hangout.altsounds.com/external.php?type=RSS2 , put it into the validator http://validator.w3.org/feed/ . They're declaring their content type as iso-8859-1 but some of the actual content, such as the quotes, is in something like cp1252 - for example they're using the byte 0x93 to represent the left quote - http://www.fileformat.info/info/unicode/char/201C/charset_support.htm .

What's annoying about this is that this doesn't show up in some tools - Firefox seems to guess what's going on and show the quotes correctly, and more to the point, SimpleXML converts the 0x93 into utf8, so it comes out as 0xc293, which exacerbates the problem.

EDIT 2

A workaround to get that feed to read a bit more correctly is to replace "ISO-8859-1" by "Windows-1252" before passing to Simple XML. It won't work 100% because it turns out that some parts of the feed are in UTF8.

The general approach, assuming that you can't get everyone in the world to correct their feeds, is to isolate whatever workarounds you require to the interface with the external system that's emitting the malformed data, and to pass in pure clear utf8 to the hub of your system. Save a dated copy of the raw external feed so you can remember in future why the workaround was required, separate off and comment the code lines that implement the workaround so it's easy to get at and change if and when the external organisation corrects its feed (or breaks it in a different way), and check it again from time to time. Unfortunately instead of programming to a spec you're programming to the current state of a bug, so there's no permanent, clean solution - the best you can do is isolate, document, and monitor.