Error PHP website crawler class using Simple HTML Dom
I have found the bug.
On my (limited) tests, the problem happens when you set depth > 1, so — seeing your code — when you load more than one page URL. One of the countless Simple HTML DOM problems, is that ->load()
method doesn't work correctly on multiple loads.
Re-instantiating html
object, the script seems work:
protected function _processAnchors( $content, $url, $depth ){ $this->html = new simple_html_dom(); # <----- $this->html->load( $content );
I tested also $this->html = str_get_html($content);
but it works only on limited sites.
Additional Note: In HTML <title>
tag is mandatory, but not all sites has well formatted HTML: consider checking for <title>
tag (and for each tag) existence to avoid additional errors.