Node.js scraping with chrome-remote-interface

python node.js google-chrome selenium screen-scraping

I know it's has been asked two years ago, but let me write it here for documentation purposes.

-- Tools of the trade --
I tried the same technique as you did (used the remote debugger for scraping) but instead of using Python i used Node.js because of it's asynchronous nature, thus making easier to work with websockets that the remote debugger relies on.

-- Runtime.evaluate --
One thing i noted is that Runtime.evaluate isn't a valid option for recovering any data if your expression involves asynchronous calls because it returns the result of the calling function and not of the callback function. You have to stick with synchronous expressions.
Example:

Array.from(document.getElementByTagName('tr'))    .map((e)=>e.children[2].innerHTML)    .filter((e)=>e.length>0)

Other thing is that when your expression returns an array Runtime.evaluate just mention that the expression returned an array but not the array itself! (infuriating i know)I got around it by simply enconding the arrays as JSON strings in the page context then decoding it back to object when it arrives at the Node.js. For example the above expression would need to be:

JSON.stringify(    Array.from(document.getElementByTagName('tr'))        .map((e)=>e.children[2].innerHTML)        .filter((e)=>e.length>0))

-- Navigation --
When you trigger a page load by using "Page.navigate", ".click()", ".submit()", "window.location.href=..." or any other way it's important to know when the next page was completely loaded before sending more instructions with Runtime.evaluate.I did the trick asking the debugger to send me the page loading events(look for the Page.enable method in the documentation) then waiting for the "Page.loadEventFired" event before sending more expressions.

python node.js google-chrome selenium screen-scraping

JavaScript expressions evaluated by Runtime.evaluate are executed within the page context just like what happens in the DevTools console.

You can interact with the DOM using the DOM domain, e.g., DOM.getDocument, DOM.querySelector, etc.

Also remember that chrome-remote-interface is mainly a library meaning that it allows you to write your own Node.js applications, the chrome-remote-interface inspect is just an utility.

There are several places where you can get help:

If you ask something more specific I'd be happy to try to help you with that.

Finally you may want to take a look at automated-chrome-profiling, which I think is structurally similar to what you're trying to achieve.

CodeHunter

Node.js scraping with chrome-remote-interface

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last