Trying to use HTML DOM parser to get main image on Amazon page Trying to use HTML DOM parser to get main image on Amazon page php php

Trying to use HTML DOM parser to get main image on Amazon page


Using the Amazon API might be the better solution, but this is not the question.

As I downloaded the html from the sample web page (content without running JavaScript), I could not find any tag with id="landingImage"[1]. But I could find an image tag with id="main-image". Trying to extract this tag with DOMDocument wasn't successful. Somehow the methods loadHTML() and loadHTMLFile() were't able to parse the html.

But the interesting part can be extracted with a regular expression. The following code will give you the image source:

$url = 'http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20';$html = file_get_contents($url);$matches = array();if (preg_match('#<img[^>]*id="main-image"[^>]*src="(.*?)"[^>]*>#', $html, $matches)) {    $src = $matches[1];}// The source of the image is// $src: 'http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg'

[1] The html source was downloaded within php with the function file_get_contents. Downloading the html source with Firefox results in a different html code. In the last case you will find an image tag with the id attribute "landingImage" (JavaScript is NOT enabled!). It seems that the downloaded html source depends on the client (headers in the http request).


On page with your example img tag with id="landingImage" don't contains attribute src. This attribute is added by JavaScript.

But this tag contains attribute data-a-dynamic-image with value {"http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg":[200,200]}

You can try get value for this attribute and then just parse value. By regexp or by strpos and substr functions.


It looks like not every page uses the same html. You will need to check for many possibilities and log cases when images are not found so you can add support for them. For instance:

$url = 'http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20';$html = file_get_html($url);$image = $html->find('img[id="landingImage"]', 0);if(!is_object($image)) {  $image = $html->find('img[id="main-image"]', 0);}if(!is_object($image)) {  // Log the error to apache error log  error_log('Could not find amazon image: ' + $url);} else {  print $image->src;}