Extracting Site data through Web Crawler outputs error due to mis-match of Array Index
Instead of writing your own parser solution you could use an existing one like Symfony's DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html
$crawler = new Crawler($returned_content);$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) { return $node->text();});
Or if you want to traverse the DOM tree yourself you can use DOMDocument
's loadHTML
http://php.net/manual/en/domdocument.loadhtml.php
$document = new DOMDocument();$document->loadHTML($returned_content);foreach ($document->getElementsByTagName('a') as $link) { $text = $link->nodeValue;}
EDIT:
To get the links you want, the code assumes you have a $returned_content
variable with the HTML you want to parse.
// creating a new instance of DOMDocument (DOM = Document Object Model)$domDocument = new DOMDocument();// save previous libxml error reporting and set error reporting to internal// to be able to parse not well formed HTML doc$previousErrorReporting = libxml_use_internal_errors(true);$domDocument->loadHTML($returned_content);libxml_use_internal_errors($previousErrorReporting);$links = [];/** @var DOMElement $node */// getting all <a> element from the HTMLforeach ($domDocument->getElementsByTagName('a') as $node) { $parentNode = $node->parentNode; // checking if the <a> is under a <td> that has class="FootNotes2" $isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2'; // checking if the <a> has class="Links2" $isLinkOfLink2Class = $node->getAttribute('class') == 'Links2'; // as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) { $links[] = [ 'href' => $node->getAttribute('href'), 'text' => $parentNode->textContent, ]; }}print_r($links);
This will create you an array similar to:
Array( [0] => Array ( [href] => /files/forum/2017/1/837242.php [text] => Q@Q Drill Time ① - cardio69 ) [1] => Array ( [href] => /files/forum/2017/1/837356.php [text] => study partner in Houston - lacy ) [2] => Array ( [href] => /files/forum/2017/1/837110.php [text] => Serious dedicated study partner for U World - step12013 ) ...
Using the Simple HTML DOM Parser library, you can use the following code:
<?php require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file. $html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1'); foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element $element->href = "http://www.usmleforum.com" . $element->href; // you can also access only certain attributes of the elements (e.g. the url). echo $element.'</br>'; // do something with the elements. }?>
I tried the same code for another site. and it works.Please take a look at it:
<?php function get_data($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_URL,$url); $result=curl_exec($ch); curl_close($ch); return $result; } $returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/'); $first_step = explode( '<tbody id="threadbits_forum_26">' , $returned_content ); $second_step = explode('</tbody>', $first_step[1]); $third_step = explode('<tr>', $second_step[0]); // print_r($third_step); foreach ($third_step as $element) { $child_first = explode( '<td class="alt1"' , $element ); $child_second = explode( '</td>' , $child_first[1] ); $child_third = explode( '<a href=' , $child_second[0] ); $child_fourth = explode( '</a>' , $child_third[1] ); echo $final = "<a href=".$child_fourth[0]."</a></br>"; } ?>
I know its too much to ask, but can you please make a code out of these two which make the crawler work.
@jkmak