scrape html generated by javascript with python

javascript python browser screen-scraping

In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.

You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.

javascript python browser screen-scraping

Since there is no comprehensive answer here, I'll go ahead and write one.

To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)

Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.

So here's what you do:

Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.

from selenium import webdriverdriver = webdriver.PhantomJS()driver.get('http://jokes.cc.com/')soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is completedriver.save_screenshot('screen.png') # save a screenshot to diskdriver.quit()

javascript python browser screen-scraping

I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.

This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.

CodeHunter

scrape html generated by javascript with python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last