How to skip invalid characters in XML file using PHP How to skip invalid characters in XML file using PHP xml xml

How to skip invalid characters in XML file using PHP


Do you have control over the XML? If so, ensure the data is enclosed in <![CDATA[ .. ]]> blocks.

And you also need to clear the invalid characters:

/** * Removes invalid XML * * @access public * @param string $value * @return string */function stripInvalidXml($value){    $ret = "";    $current;    if (empty($value))     {        return $ret;    }     $length = strlen($value);    for ($i=0; $i < $length; $i++)    {        $current = ord($value[$i]);        if (($current == 0x9) ||            ($current == 0xA) ||            ($current == 0xD) ||            (($current >= 0x20) && ($current <= 0xD7FF)) ||            (($current >= 0xE000) && ($current <= 0xFFFD)) ||            (($current >= 0x10000) && ($current <= 0x10FFFF)))        {            $ret .= chr($current);        }        else        {            $ret .= " ";        }    }    return $ret;}


I decided to test all UTF-8 values (0-1114111) to make sure things work as they should. Using preg_replace() causes a NULL to be returned due to errors when testing all utf-8 values. This is the solution I've come up.

$utf_8_range = range(0, 1114111);$output = ords_to_utfstring($utf_8_range);$sanitized = sanitize_for_xml($output);/** * Removes invalid XML * * @access public * @param string $value * @return string */function sanitize_for_xml($input) {  // Convert input to UTF-8.  $old_setting = ini_set('mbstring.substitute_character', '"none"');  $input = mb_convert_encoding($input, 'UTF-8', 'auto');  ini_set('mbstring.substitute_character', $old_setting);  // Use fast preg_replace. If failure, use slower chr => int => chr conversion.  $output = preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '', $input);  if (is_null($output)) {    // Convert to ints.    // Convert ints back into a string.    $output = ords_to_utfstring(utfstring_to_ords($input), TRUE);  }  return $output;}/** * Given a UTF-8 string, output an array of ordinal values. * * @param string $input *   UTF-8 string. * @param string $encoding *   Defaults to UTF-8. * * @return array *   Array of ordinal values representing the input string. */function utfstring_to_ords($input, $encoding = 'UTF-8'){  // Turn a string of unicode characters into UCS-4BE, which is a Unicode  // encoding that stores each character as a 4 byte integer. This accounts for  // the "UCS-4"; the "BE" prefix indicates that the integers are stored in  // big-endian order. The reason for this encoding is that each character is a  // fixed size, making iterating over the string simpler.  $input = mb_convert_encoding($input, "UCS-4BE", $encoding);  // Visit each unicode character.  $ords = array();  for ($i = 0; $i < mb_strlen($input, "UCS-4BE"); $i++) {    // Now we have 4 bytes. Find their total numeric value.    $s2 = mb_substr($input, $i, 1, "UCS-4BE");    $val = unpack("N", $s2);    $ords[] = $val[1];  }  return $ords;}/** * Given an array of ints representing Unicode chars, outputs a UTF-8 string. * * @param array $ords *   Array of integers representing Unicode characters. * @param bool $scrub_XML *   Set to TRUE to remove non valid XML characters. * * @return string *   UTF-8 String. */function ords_to_utfstring($ords, $scrub_XML = FALSE) {  $output = '';  foreach ($ords as $ord) {    // 0: Negative numbers.    // 55296 - 57343: Surrogate Range.    // 65279: BOM (byte order mark).    // 1114111: Out of range.    if (   $ord < 0        || ($ord >= 0xD800 && $ord <= 0xDFFF)        || $ord == 0xFEFF        || $ord > 0x10ffff) {      // Skip non valid UTF-8 values.      continue;    }    // 9: Anything Below 9.    // 11: Vertical Tab.    // 12: Form Feed.    // 14-31: Unprintable control codes.    // 65534, 65535: Unicode noncharacters.    elseif ($scrub_XML && (               $ord < 0x9            || $ord == 0xB            || $ord == 0xC            || ($ord > 0xD && $ord < 0x20)            || $ord == 0xFFFE            || $ord == 0xFFFF            )) {      // Skip non valid XML values.      continue;    }    // 127: 1 Byte char.    elseif ( $ord <= 0x007f) {      $output .= chr($ord);      continue;    }    // 2047: 2 Byte char.    elseif ($ord <= 0x07ff) {      $output .= chr(0xc0 | ($ord >> 6));      $output .= chr(0x80 | ($ord & 0x003f));      continue;    }    // 65535: 3 Byte char.    elseif ($ord <= 0xffff) {      $output .= chr(0xe0 | ($ord >> 12));      $output .= chr(0x80 | (($ord >> 6) & 0x003f));      $output .= chr(0x80 | ($ord & 0x003f));      continue;    }    // 1114111: 4 Byte char.    elseif ($ord <= 0x10ffff) {      $output .= chr(0xf0 | ($ord >> 18));      $output .= chr(0x80 | (($ord >> 12) & 0x3f));      $output .= chr(0x80 | (($ord >> 6) & 0x3f));      $output .= chr(0x80 | ($ord & 0x3f));      continue;    }  }  return $output;}

And to do this on a simple object or array

// Recursive sanitize_for_xml.function recursive_sanitize_for_xml(&$input){  if (is_null($input) || is_bool($input) || is_numeric($input)) {    return;  }  if (!is_array($input) && !is_object($input)) {    $input = sanitize_for_xml($input);  }  else {    foreach ($input as &$value) {      recursive_sanitize_for_xml($value);    }  }}


If you have control over the data, ensure that it is encoded correctly (i.e. is in the encoding that you promised in the xml tag, e.g. if you have:

<?xml version="1.0" encoding="UTF-8"?>

then you'll need to ensure your data is in UTF-8.

If you don't have control over the data, yell at those who do.

You can use a tool like xmllint to check which part(s) of the data are not valid.