Error PHP website crawler class using Simple HTML Dom Error PHP website crawler class using Simple HTML Dom curl curl

Error PHP website crawler class using Simple HTML Dom


I have found the bug.

On my (limited) tests, the problem happens when you set depth > 1, so — seeing your code — when you load more than one page URL. One of the countless Simple HTML DOM problems, is that ->load() method doesn't work correctly on multiple loads.

Re-instantiating html object, the script seems work:

protected function _processAnchors( $content, $url, $depth ){    $this->html = new simple_html_dom();                                    # <-----    $this->html->load( $content );

I tested also $this->html = str_get_html($content); but it works only on limited sites.

Additional Note: In HTML <title> tag is mandatory, but not all sites has well formatted HTML: consider checking for <title> tag (and for each tag) existence to avoid additional errors.