how to extract links and titles from a .html page? how to extract links and titles from a .html page? php php

how to extract links and titles from a .html page?


Thank you everyone, I GOT IT!

The final code:

$html = file_get_contents('bookmarks.html');//Create a new DOM document$dom = new DOMDocument;//Parse the HTML. The @ is used to suppress any parsing errors//that will be thrown if the $html string isn't valid XHTML.@$dom->loadHTML($html);//Get all links. You could also use any other tag name here,//like 'img' or 'table', to extract other tags.$links = $dom->getElementsByTagName('a');//Iterate over the extracted links and display their URLsforeach ($links as $link){    //Extract and show the "href" attribute.    echo $link->nodeValue;    echo $link->getAttribute('href'), '<br>';}

This shows you the anchor text assigned and the href for all links in a .html file.

Again, thanks a lot.


This is probably sufficient:

$dom = new DOMDocument;$dom->loadHTML($html);foreach ($dom->getElementsByTagName('a') as $node){  echo $node->nodeValue.': '.$node->getAttribute("href")."\n";}


This is an example, you can use in your case this:

$content = file_get_contents('bookmarks.html');

Run this:

<?php$content = '<html><title>Random Website I am Crawling</title><body>Click <a href="http://clicklink.com">here</a> for foobarAnother site is http://foobar.com</body></html>';$regex = "((https?|ftp)\:\/\/)?"; // SCHEME$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP$regex .= "(\:[0-9]{2,5})?"; // Port$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor$matches = array(); //create array$pattern = "/$regex/";preg_match_all($pattern, $content, $matches); print_r(array_values(array_unique($matches[0])));echo "<br><br>";echo implode("<br>", array_values(array_unique($matches[0])));

Output:

Array(    [0] => http://clicklink.com    [1] => http://foobar.com)

http://clicklink.com

http://foobar.com