Wordpress/Apache - 404 error with unicode characters in image filenames Wordpress/Apache - 404 error with unicode characters in image filenames wordpress wordpress

Wordpress/Apache - 404 error with unicode characters in image filenames


Unicode normalisation.

0xC3 0xA5 is the UTF-8 encoding for U+00E5 a-with-ring.

0xCC 0x8A is the UTF-8 encoding for U+030A combining ring.

U+0035 is the composed (Normal Form C) way of writing an a-ring; an a letter followed by U+030A is the decomposed (Normal Form D) way of writing it. å vs å - they should look the same, though they may differ slightly depending on font rendering.

Now normally it doesn't really matter which one you've got because sensible filesystems leave them untouched. If you save a file called [char U+00E5].txt (å.txt), it stays called that under Windows and Linux.

Macs, on the other hand, are insane. The filesystem prefers Normal Form D, to the extent that any composed characters you pass into it get converted into decomposed ones. If you put a file in called [char U+00E5].txt and immediately list the directory, you'll find you've actually got a file called a[char U+030A].txt. You can still access the file as [char U+00E5].txt on a Mac because it'll convert that input into Normal Form D too before looking it up, but you cannot recover the same filename in character sequence terms as you put in: it's a lossy conversion.

So if you save your files on a Mac and then transfer to a filesystem where [char U+00E5].txt and a[char U+030A].txt refer to different files, you will get broken links.

Update the pages to point to the Normal Form D versions of the URLs, or re-upload the files from a filesystem that doesn't egregiously mangle Unicode characters.

Think Different, Cause Bizarre Interoperability Problems.