Using the inspect element feature in google chrome to scrape web sites [closed] Using the inspect element feature in google chrome to scrape web sites [closed] google-chrome google-chrome

Using the inspect element feature in google chrome to scrape web sites [closed]


you can use greasemonkey or tampermonkey to do this quite easily.you simply define the url(s) in your userscript, and then navigate to the page to invoke.you can use a top page containing an iframe that navigates to each page on a schedule.when the page shows in the iframe, the userscript runs, and your data is saved.

the scripting is basic javascript, nothing fancy, let me know if you need a starter.the biggest catch would be downloading the file, a fairly new capability for JS, but simple to do using a download library, like mine (shameless).

so, basically, you can have a textarea with a list of urls, one per line, grab a line and set the iframe's .src to the url, invoking the userscript. You can drill down into the page with CSS query selectors, or save the whole page, just grab .outerHTML of the tag whose code you need. i'll be happy to illustrate if need be, but once you get it working, you'll never go back to server-server scraping again.

EDIT:

A framing dispatcher page to simply load each needed page into an iframe, thus triggering the userScript:

<html><iframe id=frame1></iframe><script>var base="http://www.yelp.ca/search?cflt=coffee&find_loc=Toronto,%20ON&start="; //the part of the url that stays the samevar pages=[20, 40, 60, 80];  //all the differing url parts to be concat'd at the endvar delay= 1000 * 30; //30 sec delay, adjust if neededvar slot=0; //current shown page's index in pagesfunction doNext(){  var page=pages[slot++];  if(!page){ page=pages[slot=0]; }  frame1.src=base+page;}setInterval(doNext, delay);</script></html>

EDIT2: userScript code:

// ==UserScript==// @name       yelp scraper// @namespace  http://anon.org// @version    0.1// @description  grab listing from yelp// @match     http://www.yelp.ca/search?cflt=coffee&find_loc=Toronto,%20ON&start=*// @copyright  2013, dandavis// ==/UserScript==function Q(a,b){var t="querySelectorAll";b=b||document.documentElement;if(!b[t]){return}if(b.split){b=Q(b)[0]}return [].slice.call(b[t](a))||[]}function download(strData,strFileName,strMimeType){var D=document,A=arguments,a=D.createElement("a"),d=A[0],n=A[1],t=A[2]||"text/plain";a.href="data:"+strMimeType+","+escape(strData);if('download'in a){a.setAttribute("download",n);a.innerHTML="downloading...";D.body.appendChild(a);setTimeout(function(){var e=D.createEvent("MouseEvents");e.initMouseEvent("click",true,false,window,0,0,0,0,0,false,false,false,false,0,null);a.dispatchEvent(e);D.body.removeChild(a);},66);return true;};var f=D.createElement("iframe");D.body.appendChild(f);f.src="data:"+(A[2]?A[2]:"application/octet-stream")+(window.btoa?";base64":"")+","+(window.btoa?window.btoa:escape)(strData);setTimeout(function(){D.body.removeChild(f);},333);return true;}window.addEventListener("load", function(){  var code=Q("#businessresults")[0].outerHTML;  download(code, "yelp_page_"+location.href.split("start=")[1].split("&")[0]+".txt", "x-application/nothing");});

note that it saves the html as .txt to avoid a chrome warning about potentially harmful files.you can rename them in bulk, or try making up a new extension and associating it with a browser.

EDIT: forgot to mention to turn off file-saving confirmation in chrome for un-attended use: Settings\Show advanced settings...\Ask where to save each file before downloading (uncheck it)


I would check out Selenium to automate browser functions. You can automate a search by id/name and then do a check to see if it exists, or parse through the html however you would like in an automated fashion.