How to scrape iframe content using cURL How to scrape iframe content using cURL curl curl

How to scrape iframe content using cURL


--Edit--You could load the page contents into a string, parse the string for iframe, then load the iframe source into another string.

$wrapperPage = file_get_contents('http://localhost/test/index.html');$pattern = '/\.*src=\".*\.html"\.*/';$iframeSrc = preg_match($pattern, $wrapperPage, $matches);if (!isset($matches[0])) {    throw new Exception('No match found!');}$src = $matches[0];$src = str_ireplace('"', '', $src);$src = str_ireplace('src=', '', $src);$src = trim($src);$iframeContents = file_get_contents($src);var_dump($iframeContents);

--Original--

Work on your acceptance rate (accept answers to previously answered questions).

The url you are setting the curl handler to is the file wrapping the i-frame, try setting it to the url of the iframe:

$url = "http://localhost/test/france.html";


note that occasionally for a variety of reasons the iframe curl can't be read outside the context of their own server and looking at the curl directly throws some type of 'can't be read directly or externally' error message.

in these cases, you can use curl_setopt($ch, CURLOPT_REFERER, $fullpageurl); (if you're in php and reading the text using curl_exec) and then curl_exec thinks the iframe is in the original page and you can read the source.

so if for whatever reason france.html couldn't be read outside the context of the larger page that included it as an iframe, you can still get the source using methods above using CURLOPT_REFERER and setting the main page (test/index.html in the original question) as the referrer.


To answer your question, your pattern does not match the input text:

          <p>The Capitol of France is: Paris</p>

You have an extra space before the closing paragraph tag, which can never match:

preg_match("'The Capitol of France is:(.*?). </p>'si"

You should have the space before the capture group and remove the redundant . thereafter:

preg_match("'The Capitol of France is: (.*?)</p>'si"

To use optional space at any of the two positions, use \s* instead:

preg_match("'The Capitol of France is:\s*(.*?)\s*</p>'si"

You could also make the capture group only match letters with (\w+) to be more specific.