scrape websites with infinite scrolling

You can use selenium to scrap the infinite scrolling website like twitter or facebook.

Step 1 : Install Selenium using pip

pip install selenium

Step 2 : use the code below to automate infinite scroll and extract the source code

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.ui import Selectfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.common.exceptions import TimeoutExceptionfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.common.exceptions import NoSuchElementExceptionfrom selenium.common.exceptions import NoAlertPresentExceptionimport sysimport unittest, time, reclass Sel(unittest.TestCase):    def setUp(self):        self.driver = webdriver.Firefox()        self.driver.implicitly_wait(30)        self.base_url = "https://twitter.com"        self.verificationErrors = []        self.accept_next_alert = True    def test_sel(self):        driver = self.driver        delay = 3        driver.get(self.base_url + "/search?q=stckoverflow&src=typd")        driver.find_element_by_link_text("All").click()        for i in range(1,100):            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")            time.sleep(4)        html_source = driver.page_source        data = html_source.encode('utf-8')if __name__ == "__main__":    unittest.main()

Step 3 : Print the data if required.

python screen-scraping scraper

Most sites that have infinite scrolling do (as Lattyware notes) have a proper API as well, and you will likely be better served by using this rather than scraping.

But if you must scrape...

Such sites are using JavaScript to request additional content from the site when you reach the bottom of the page. All you need to do is figure out the URL of that additional content and you can retrieve it. Figuring out the required URL can be done by inspecting the script, by using the Firefox Web console, or by using a debug proxy.

For example, open the Firefox Web Console, turn off all the filter buttons except Net, and load the site you wish to scrape. You'll see all the files as they are loaded. Scroll the page while watching the Web Console and you'll see the URLs being used for the additional requests. Then you can request that URL yourself and see what format the data is in (probably JSON) and get it into your Python script.

python screen-scraping scraper

Finding the url of the ajax source will be the best option but it can be cumbersome for certain sites. Alternatively you could use a headless browser like QWebKit from PyQt and send keyboard events while reading the data from the DOM tree. QWebKit has a nice and simple api.

CodeHunter

scrape websites with infinite scrolling

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last