PHP: UTF 8 characters encoding

Your page is being served as UTF-8 so I'd point my finger at the database.

Make sure the connection is in UTF-8 before any SELECTs or INSERTS - in MySQL:

SET NAMES "utf8"

php utf-8 curl simplexml

Just a quick note about CURLOPT_ENCODING : it's the Accept-Encoding header, which is not the same at all as character encoding. Supported accept encodings are "identity", "deflate", and "gzip".

php utf-8 curl simplexml

Like all debugging, you start by isolating the problem:

I am scraping a list of RSS feeds by using cURL, - look at the xml from the RSS feed that's giving the problem (there's more than one feed, so it's possible for some feeds to be right and for the feeds that are wrong to be wrong in different ways)

and then I am reading and parsing the RSS data with SimpleXML. - print out the field that SimpleXML read out - is it ok or does a problem show up?

The sorted data is then inserted into a mySQL database. - print out hex(field), length(field), and char_length(field) for the piece of data that's giving the problem.

EDIT

Take the feed http://hangout.altsounds.com/external.php?type=RSS2 , put it into the validator http://validator.w3.org/feed/ . They're declaring their content type as iso-8859-1 but some of the actual content, such as the quotes, is in something like cp1252 - for example they're using the byte 0x93 to represent the left quote - http://www.fileformat.info/info/unicode/char/201C/charset_support.htm .

What's annoying about this is that this doesn't show up in some tools - Firefox seems to guess what's going on and show the quotes correctly, and more to the point, SimpleXML converts the 0x93 into utf8, so it comes out as 0xc293, which exacerbates the problem.

EDIT 2

A workaround to get that feed to read a bit more correctly is to replace "ISO-8859-1" by "Windows-1252" before passing to Simple XML. It won't work 100% because it turns out that some parts of the feed are in UTF8.

The general approach, assuming that you can't get everyone in the world to correct their feeds, is to isolate whatever workarounds you require to the interface with the external system that's emitting the malformed data, and to pass in pure clear utf8 to the hub of your system. Save a dated copy of the raw external feed so you can remember in future why the workaround was required, separate off and comment the code lines that implement the workaround so it's easy to get at and change if and when the external organisation corrects its feed (or breaks it in a different way), and check it again from time to time. Unfortunately instead of programming to a spec you're programming to the current state of a bug, so there's no permanent, clean solution - the best you can do is isolate, document, and monitor.

CodeHunter

PHP: UTF 8 characters encoding

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last