How to manage a 'pool' of PhantomJS instances

node.js web-scraping phantomjs jsdom

I setup a PhantomJs Cloud Service, and it pretty much does what you are asking. It took me about 5 weeks of work implement.

The biggest problem you'll run into is the known-issue of memory leaks in PhantomJs. The way I worked around this is to cycle my instances every 50 calls.

The second biggest problem you'll run into is per-page processing is very cpu and memory intensive, so you'll only be able to run 4 or so instances per CPU.

The third biggest problem you'll run into is that PhantomJs is pretty wacky with page-finish events and redirects. You'll be informed that your page is finished rendering before it actually is. There are a number of ways to deal with this, but nothing 'standard' unfortunately.

The fourth biggest problem you'll have to deal with is interop between nodejs and phantomjs thankfully there are a lot of npm packages that deal with this issue to choose from.

So I know I'm biased (as I wrote the solution I'm going to suggest) but I suggest you check out PhantomJsCloud.com which is free for light usage.

Jan 2015 update: Another (5th?) big problem I ran into is how to send the request/response from the manager/load-balancer. Originally I was using PhantomJS's built-in HTTP server, but kept running into it's limitations, especially regarding maximum response-size. I ended up writing the request/response to the local file-system as the lines of communication. * Total time spent on implementation of the service represents perhaps 20 man-weeks issues is perhaps 1000 hours of work. * and FYI I am doing a complete rewrite for the next version.... (in-progress)

node.js web-scraping phantomjs jsdom

The async JavaScript library works in Node and has a queue function that is quite handy for this kind of thing:

queue(worker, concurrency)
Creates a queue object with the specified concurrency. Tasks added to the queue will be processed in parallel (up to the concurrency limit). If all workers are in progress, the task is queued until one is available. Once a worker has completed a task, the task's callback is called.

Some pseudocode:

function getSourceViaPhantomJs(url, callback) {  var resultingHtml = someMagicPhantomJsStuff(url);  callback(null, resultingHtml);}var q = async.queue(function (task, callback) {  // delegate to a function that should call callback when it's done  // with (err, resultingHtml) as parameters  getSourceViaPhantomJs(task.url, callback);}, 5); // up to 5 PhantomJS calls at a timeapp.get('/some/url', function(req, res) {  q.push({url: params['url_to_scrape']}, function (err, results) {    res.end(results);  });});

Check out the entire documentation for queue at the project's readme.

node.js web-scraping phantomjs jsdom

For my master thesis, I developed the library phantomjs-pool which does exactly this. It allows to provide jobs which are then mapped to PhantomJS workers. The library handles the job distribution, communication, error handling, logging, restarting and some more stuff. The library was successfully used to crawl more than one million pages.

Example:

The following code executes a Google search for the numbers 0 to 9 and saves a screenshot of the page as googleX.png. Four websites are crawled in parallel (due to the creation of four workers). The script is started via node master.js.

master.js (runs in the Node.js environment)

var Pool = require('phantomjs-pool').Pool;var pool = new Pool({ // create a pool    numWorkers : 4,   // with 4 workers    jobCallback : jobCallback,    workerFile : __dirname + '/worker.js', // location of the worker file    phantomjsBinary : __dirname + '/path/to/phantomjs_binary' // either provide the location of the binary or install phantomjs or phantomjs2 (via npm)});pool.start();function jobCallback(job, worker, index) { // called to create a single job    if (index < 10) { // index is count up for each job automatically        job(index, function(err) { // create the job with index as data            console.log('DONE: ' + index); // log that the job was done        });    } else {        job(null); // no more jobs    }}

worker.js (runs in the PhantomJS environment)

var webpage = require('webpage');module.exports = function(data, done, worker) { // data provided by the master    var page = webpage.create();    // search for the given data (which contains the index number) and save a screenshot    page.open('https://www.google.com/search?q=' + data, function() {        page.render('google' + data + '.png');        done(); // signal that the job was executed    });};

CodeHunter

How to manage a 'pool' of PhantomJS instances

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last