How to read a web page in PHP

php web

The easy way: fopen() or file_get_contents() the URL: fopen("http://google.com/", "r")
The smart way: Use the cURL library
The other smart way: http_get() from PHP's http module
The hard way: Craft a HTTP request and send it with fsockopen() or stream_socket_client()
The C way: Send a HTTP request using sockets
The stupid way: call an external tool such as wget or curl through system()

None of these is guaranteed to be available on your server though.

php web

One way:

$url = "http://www.brothersoft.com/publisher/xtracomponents.html";$page = file_get_contents($url);$outfile = "xtracomponents.html";file_put_contents($outfile, $page);

The code above is just an example and lacks any(!) error checking and handling.

php web

As the other answers have said, either standard PHP stream functions or cURL is your best bet for retrieving the HTML. As for removing the tags, here are a couple approaches:

Option #1: Use the Tidy extension, if available on your server, to walk through the document tree recursively and return the text from the nodes. Something like this:

function textFromHtml(TidyNode $node) {    if ($node->isText()) {        return $node->value;    } else if ($node->hasChildren()) {        $childText = '';        foreach ($node->child as $child)           $childText .= textFromHtml($child);        return $childText;    }    return '';}

You might want something more sophisticated than that, e.g., that replaces <br /> tags (where $node->name == 'br') with newlines, but this will do for a start.

Then, load the text of the HTML into a Tidy object and call your function on the body node. If you have the contents in a string, use:

$tidy = new tidy();$tidy->parseString($contents);$text = textFromHtml($tidy->body());

Option #2: Use regexes to strip everything between < and >. You could (and probably should) develop a more sophisticated regex that, for example, matched only valid HTML start or end tags. Any errors in the synax of the page, like a stray angle bracket in body text, could mean garbage output if you aren't careful. This is why Tidy is so nice (it is specifically designed to clean up bad pages), but it might not be available.

CodeHunter

How to read a web page in PHP

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last