DOMDocument::loadHTML(): input conversion failed due to input error DOMDocument::loadHTML(): input conversion failed due to input error curl curl

DOMDocument::loadHTML(): input conversion failed due to input error


I see a solution today .

$html=new DOMDocument();  $html_source    = get_html();$html_source    =mb_convert_encoding( $html_source, "HTML-ENTITIES", "UTF-8");$html->loadHTML( $html_source );


Without seeing the full head of the document that you are parsing I can only guess, but if the with the character encoding data does not come directly after the tag, you may be running into a situation where DomDocument is using its default of ISO-8859-1 and running into the【 character (the first three "invalid" bytes in gb2312) of which the 0x80 byte would be the first bit of nonsense since this is an unused code point in ISO-8859-1. This would likely trigger the bug in DomDocument discussed in the comments above. And could easily happen if the element is included before the content-type meta information.

The only thing I can think of to try would be to run the html through a bit of prep and move that content-type meta tag to right after the tag to try to make it use the correct character set. If you use mb_convert_encoding or iconv to convert the encoding to iso-5589-1 or utf-8, make sure that you modify the meta information because DomDocument is, unfortunately, brittle in many ways.


<?php$contents = file_get_contents('xml.xml');function convert_utf8( $string ) {     if ( strlen(utf8_decode($string)) == strlen($string) ) {           // $string is not UTF-8        return iconv("ISO-8859-1", "UTF-8", $string);    } else {        // already UTF-8        return $string;    }}$contents = mb_convert_encoding( $contents, mb_detect_encoding($contents), "UTF-8");$xml = simplexml_load_string(convert_utf8($contents));print_r($xml);