Scrape dynamically loaded website Scrape dynamically loaded website curl curl

Scrape dynamically loaded website


When I load the page "http://proxydb.net" using cURL, or try to scrape the page, then the response body is empty - because this particular website use a user-agent whitelist, and if your user-agent is not on the whitelist, you just get served a blank page. presumably, all major web browsers are whitelisted (Chrome, Internet Explorer, Edge, Safari, Opera, etc), but here's a specific user-agent that is whitelisted:

Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 

(the user-agent of Chrome 65 running on windows 7 x64), and thus, this works:

curl 'http://proxydb.net/' -H 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'

as for how to load content dynamically, that's usually done with XMLHttpRequests, or in older code, iframes.

Apparently, the page is dynamically loaded using JavaScript. - wrong, these guys are not loading the proxy list dynamically, they're embedded directly into the front page (as long as you're using a whitelisted user-agent), obscured as

var q = '42.86.831'.split('').reverse().join('');var yy = /* */ atob('\x4d\x43\x34\x79\x4d\x54\x67\x3d'.replace(/\\x([0-9A-Fa-f]{2})/g, function() {    return String.fromCharCode(parseInt(arguments[1], 16))}));var pp = (3109 - ([] + [])) /**/ + (+document.querySelector('[data-numr]').getAttribute('data-numr')) - [] + [];document.write('<a href="/' + q + yy + '/' + pp + '#http">' + q + yy + String.fromCharCode(58) + pp + '</a>');

(which, together with the data-numr div in this case ,translates to 138.68.240.218:3128 - and it's actually encrypted, and the decryption key is in a div looking like <div style="display:none" data-numr="19"></div>, here the key was 19.)