crawl site that has infinite scrolling using python

You can use selenium to scrap the infinite scrolling website like twitter or facebook.

Step 1 : Install Selenium using pip

pip install selenium

Step 2 : use the code below to automate infinite scroll and extract the source code

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.ui import Selectfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.common.exceptions import TimeoutExceptionfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.common.exceptions import NoSuchElementExceptionfrom selenium.common.exceptions import NoAlertPresentExceptionimport sysimport unittest, time, reclass Sel(unittest.TestCase):    def setUp(self):        self.driver = webdriver.Firefox()        self.driver.implicitly_wait(30)        self.base_url = "https://twitter.com"        self.verificationErrors = []        self.accept_next_alert = True    def test_sel(self):        driver = self.driver        delay = 3        driver.get(self.base_url + "/search?q=stackoverflow&src=typd")        driver.find_element_by_link_text("All").click()        for i in range(1,100):            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")            time.sleep(4)        html_source = driver.page_source        data = html_source.encode('utf-8')if __name__ == "__main__":    unittest.main()

The for loop allows you to parse through the infinite scrolls and post which you can extract the loaded data.

Step 3 : Print the data if required.

python selenium web-crawler scrapy

from selenium.webdriver.common.keys import Keysimport selenium.webdriverdriver = selenium.webdriver.Firefox()driver.get("http://www.something.com")lastElement = driver.find_elements_by_id("someId")[-1]lastElement.send_keys(Keys.NULL)

This will open a page, find the bottom-most element with the given id and the scroll that element into view. You'll have to keep querying the driver to get the last element as the page loads more, and I've found this to be pretty slow as pages get large. The time is dominated by the call to driver.find_element_* because I don't know of a way to explicitly query the last element in the page.

Through experimentation you might find there is an upper limit to the amount of elements the page loads dynamically, and it would be best if you wrote something that loaded that number and only then made a call to driver.find_element_*.

python selenium web-crawler scrapy

This is short & simple code which is working for me:

SCROLL_PAUSE_TIME = 20# Get scroll heightlast_height = driver.execute_script("return document.body.scrollHeight")while True:    # Scroll down to bottom    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")    # Wait to load page    time.sleep(SCROLL_PAUSE_TIME)    # Calculate new scroll height and compare with last scroll height    new_height = driver.execute_script("return document.body.scrollHeight")    if new_height == last_height:        break    last_height = new_heightposts = driver.find_elements_by_class_name("post-text")for block in posts:    print(block.text)

CodeHunter

crawl site that has infinite scrolling using python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last