Trouble getting the name of a product from a webpage

php curl web-scraping simple-html-dom

@t.m.adam already solved the problem, i just want to add that there's no good reason to use simple_html_dom today, seems unmaintained, development stopped in 2014, there's lots of unresolved bugreports, and most importantly, DOMDocument & DOMXPath can do just about everything simple_html_dom can, and is maintained, and is an integrated part of PHP, which means there's nothing to include/bundle with your script. parsing it with DOMDocument & DOMXPath would look like:

$htmlContent = curl_exec($ch);curl_close($ch);fclose($cookieFileh); // thanks to tmpfile(), this also deletes the cookie file.$dom = @DOMDocument::loadHTML($htmlContent);$xp=new DOMXPath($dom);$itemTitle = $xp->query('//*[@id="bannerComponents-Container"]//*[@itemprop="name"]')->item(0)->textContent;echo $itemTitle;

php curl web-scraping simple-html-dom

Your selector works in a browser indeed, but your selector is not present when you use curl to get the page source.

Try saving the curled page in terminal and you'll see that the page structure is different from what you see in the browser.

This is true for most modern websites because they use Javascript heavily and curl does not run javascript for you.

I saved the curl results into a file, the brand info looks like this:

<a itemprop="brand" class="generic" data-tstid="Label_ItemBrand" href="/bd/shopping/men/gucci/items.aspx" dir="ltr">Gucci</a>

php curl web-scraping simple-html-dom

The main difference between your successful Python script and your PHP script is the use of session. Your PHP script doesn't use cookies, and that triggers a differend response from the server.

We have two options:

Change the selector. As mentioned in Mark's answer, the item is still on the html, but in a different tag. We could get it with this selector:
```
'a[itemprop="brand"]'
```

Use cookies. We can get the same response as your Python script if we use CURLOPT_COOKIESESSION and a temporary file to write/read the cookies.

function get_content($url) {    $cookieFileh = tmpfile();    $cookieFile=stream_get_meta_data($cookieFileh)['uri'];    $ch = curl_init();    curl_setopt($ch, CURLOPT_URL, $url);    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');    curl_setopt($ch, CURLOPT_COOKIESESSION, true);    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);     curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); //    curl_setopt($ch, CURLOPT_ENCODING, "gzip");    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);    curl_exec($ch);    $htmlContent = curl_exec($ch);    curl_close($ch);    fclose($cookieFileh); // thanks to tmpfile(), this also deletes the cookie file.    $dom = new simple_html_dom();    $dom->load($htmlContent);    $itemTitle = $dom->find('#bannerComponents-Container [itemprop="name"]', 0)->plaintext;    echo "{$itemTitle}";}$link = "https://www.farfetch.com/bd/shopping/men/gucci-rhyton-web-print-leather-sneaker-item-12964878.aspx"; get_content($link);//Gucci

This script performs two requests; the first request writes the cookies to file, the second reads and uses them.

In this case the server returns a compressed response, so I've used CURLOPT_ENCODING to unzip the contents.

Since you use headers only to set a user-agent, it's best to use the CURLOPT_USERAGENT option.

I've set CURLOPT_SSL_VERIFYPEER to false because I haven't set a certificate, and CURL fails to use HTTPS. If you can communicate with HTTPS sites it's best not to use this option for security reasons. If not, you could set a certifcate with CURLOPT_CAINFO.

CodeHunter

Trouble getting the name of a product from a webpage

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last