Extracting Site data through Web Crawler outputs error due to mis-match of Array Index Extracting Site data through Web Crawler outputs error due to mis-match of Array Index php php

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index


Instead of writing your own parser solution you could use an existing one like Symfony's DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html

$crawler = new Crawler($returned_content);$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {    return $node->text();});

Or if you want to traverse the DOM tree yourself you can use DOMDocument's loadHTMLhttp://php.net/manual/en/domdocument.loadhtml.php

$document = new DOMDocument();$document->loadHTML($returned_content);foreach ($document->getElementsByTagName('a') as $link) {    $text = $link->nodeValue;}

EDIT:

To get the links you want, the code assumes you have a $returned_content variable with the HTML you want to parse.

// creating a new instance of DOMDocument (DOM = Document Object Model)$domDocument = new DOMDocument();// save previous libxml error reporting and set error reporting to internal// to be able to parse not well formed HTML doc$previousErrorReporting = libxml_use_internal_errors(true);$domDocument->loadHTML($returned_content);libxml_use_internal_errors($previousErrorReporting);$links = [];/** @var DOMElement $node */// getting all <a> element from the HTMLforeach ($domDocument->getElementsByTagName('a') as $node) {    $parentNode = $node->parentNode;    // checking if the <a> is under a <td> that has class="FootNotes2"    $isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';    // checking if the <a> has class="Links2"    $isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';    // as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled    if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {        $links[] = [            'href' => $node->getAttribute('href'),            'text' => $parentNode->textContent,        ];    }}print_r($links);

This will create you an array similar to:

Array(    [0] => Array    (        [href] => /files/forum/2017/1/837242.php        [text] => Q@Q Drill Time ① - cardio69    )     [1] => Array    (        [href] => /files/forum/2017/1/837356.php        [text] => study partner in Houston - lacy    )    [2] => Array    (        [href] => /files/forum/2017/1/837110.php        [text] => Serious dedicated study partner for U World - step12013    )    ...


Using the Simple HTML DOM Parser library, you can use the following code:

<?php    require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.    $html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');    foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element        $element->href = "http://www.usmleforum.com" . $element->href;  // you can also access only certain attributes of the elements (e.g. the url).        echo $element.'</br>';  // do something with the elements.    }?>


I tried the same code for another site. and it works.Please take a look at it:

<?php    function get_data($url) {      $ch = curl_init();      curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);      curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);      curl_setopt($ch, CURLOPT_URL,$url);      $result=curl_exec($ch);      curl_close($ch);      return $result;    }    $returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');    $first_step = explode( '<tbody id="threadbits_forum_26">' , $returned_content );    $second_step = explode('</tbody>', $first_step[1]);    $third_step = explode('<tr>', $second_step[0]);    // print_r($third_step);    foreach ($third_step as $element) {      $child_first = explode( '<td class="alt1"' , $element );      $child_second = explode( '</td>' , $child_first[1] );      $child_third = explode( '<a href=' , $child_second[0] );      $child_fourth = explode( '</a>' , $child_third[1] );      echo $final = "<a href=".$child_fourth[0]."</a></br>";    }    ?>

I know its too much to ask, but can you please make a code out of these two which make the crawler work.

@jkmak