Node.js scraping with chrome-remote-interface Node.js scraping with chrome-remote-interface google-chrome google-chrome

Node.js scraping with chrome-remote-interface


I know it's has been asked two years ago, but let me write it here for documentation purposes.

-- Tools of the trade --
I tried the same technique as you did (used the remote debugger for scraping) but instead of using Python i used Node.js because of it's asynchronous nature, thus making easier to work with websockets that the remote debugger relies on.

-- Runtime.evaluate --
One thing i noted is that Runtime.evaluate isn't a valid option for recovering any data if your expression involves asynchronous calls because it returns the result of the calling function and not of the callback function. You have to stick with synchronous expressions.
Example:

Array.from(document.getElementByTagName('tr'))    .map((e)=>e.children[2].innerHTML)    .filter((e)=>e.length>0)

Other thing is that when your expression returns an array Runtime.evaluate just mention that the expression returned an array but not the array itself! (infuriating i know)I got around it by simply enconding the arrays as JSON strings in the page context then decoding it back to object when it arrives at the Node.js. For example the above expression would need to be:

JSON.stringify(    Array.from(document.getElementByTagName('tr'))        .map((e)=>e.children[2].innerHTML)        .filter((e)=>e.length>0))

-- Navigation --
When you trigger a page load by using "Page.navigate", ".click()", ".submit()", "window.location.href=..." or any other way it's important to know when the next page was completely loaded before sending more instructions with Runtime.evaluate.I did the trick asking the debugger to send me the page loading events(look for the Page.enable method in the documentation) then waiting for the "Page.loadEventFired" event before sending more expressions.


JavaScript expressions evaluated by Runtime.evaluate are executed within the page context just like what happens in the DevTools console.

You can interact with the DOM using the DOM domain, e.g., DOM.getDocument, DOM.querySelector, etc.

Also remember that chrome-remote-interface is mainly a library meaning that it allows you to write your own Node.js applications, the chrome-remote-interface inspect is just an utility.

There are several places where you can get help:

If you ask something more specific I'd be happy to try to help you with that.

Finally you may want to take a look at automated-chrome-profiling, which I think is structurally similar to what you're trying to achieve.