Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs) Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs) google-chrome google-chrome

Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)


Attempt to use XHR2 responseType = "document" and fall back on (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type")) with my text/html patch. See https://gist.github.com/1138724 for an example of how I detect responseType = "document support (synchronously checking response === null on an object URL created from a text/html blob).

Use the Chrome WebRequest API to hide X-Requested-With, etc. headers.


If you are fine looking at something beyond a Google Chrome Plugin, look at phantomjs which uses Qt-Webkit in background and runs just like a browser incuding making ajax requests. You can call it a headless browser as it doesn't display the output on a screen and can quitely work in background while you are doing other stuff. If you want, you can export out images, pdf out of the pages it fetches. It provides JS interface to load pages, clicking on buttons etc much like you have in a browser. You can also inject custom JS for example jQuery on any of the pages you want to scrape and use it to access the dom and export out desired data. As its using Webkit its rendering behaviour is exactly like Google Chrome.

Another option would be to use Aptana Jaxer which is based on Mozilla Engine and is very good concept in itself. It can be used as a simple scraping tool as well.


A lot of tools have been released since this question was asked.

artoo.js is one of them. It's a piece of JavaScript code meant to be run in your browser's console to provide you with some scraping utilities. It can also be used as a chrome extension.