How can I strip invalid XML characters from strings in Perl? How can I strip invalid XML characters from strings in Perl? xml xml

How can I strip invalid XML characters from strings in Perl?


The complete regex for removal of invalid xml-1.0 characters is:

# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;

for xml-1.1 it is:

# allowed: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]$str =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;# restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F]$str =~    s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go;


As almost everyone else has said, use a regular expression. It's honestly not complex enough to be worth adding to a library. Preprocess your text with a substitution.

Your comment about linefeeds above suggests that the formatting is of some importance to you so you will possibly have to decide exactly what you want to replace some characters with.

The list of invalid characters is clearly defined in the XML spec (here - http://www.w3.org/TR/REC-xml/#charsets - for example). The disallowed characters are the ASCII control characters bar carriage return, linefeed and tab. So, you are looking at a 29 character regular expression character class. That's not too bad surely.

Something like:

$text =~ s/[\x00-\x08 \x0B \x0C \x0E-\x19]//g;

should do it.


I've found a solution, but it uses the iconv command instead of perl.

$ iconv -c -f UTF-8 -t UTF-8 invalid.utf8 > valid.utf8

The solutions given above based on regular expressions do not work!!, consider the following example:

$ perl -e 'print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<root>\x{A0}\x{A0}</root>"' > invalid.xml$ perl -e 'use XML::Simple; XMLin("invalid.xml")'invalid.xml:2: parser error : Input is not proper UTF-8, indicate encoding !Bytes: 0xA0 0xA0 0x3C 0x2F$ perl -ne 's/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go; print' invalid.xml > valid.xml$ perl -e 'use XML::Simple; XMLin("valid.xml")'invalid.xml:2: parser error : Input is not proper UTF-8, indicate encoding !Bytes: 0xA0 0xA0 0x3C 0x2F

In fact, the two files invalid.xml and valid.xml are identical.

The thing is that the range "\x20-\x{D7FF}" matches valid representations of those unicode characters, but not e.g. the invalid character sequence "\x{A0}\x{A0}".