crawl site that has infinite scrolling using python crawl site that has infinite scrolling using python selenium selenium

crawl site that has infinite scrolling using python


You can use selenium to scrap the infinite scrolling website like twitter or facebook.

Step 1 : Install Selenium using pip

pip install selenium 

Step 2 : use the code below to automate infinite scroll and extract the source code

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.ui import Selectfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.common.exceptions import TimeoutExceptionfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.common.exceptions import NoSuchElementExceptionfrom selenium.common.exceptions import NoAlertPresentExceptionimport sysimport unittest, time, reclass Sel(unittest.TestCase):    def setUp(self):        self.driver = webdriver.Firefox()        self.driver.implicitly_wait(30)        self.base_url = "https://twitter.com"        self.verificationErrors = []        self.accept_next_alert = True    def test_sel(self):        driver = self.driver        delay = 3        driver.get(self.base_url + "/search?q=stackoverflow&src=typd")        driver.find_element_by_link_text("All").click()        for i in range(1,100):            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")            time.sleep(4)        html_source = driver.page_source        data = html_source.encode('utf-8')if __name__ == "__main__":    unittest.main()

The for loop allows you to parse through the infinite scrolls and post which you can extract the loaded data.

Step 3 : Print the data if required.


from selenium.webdriver.common.keys import Keysimport selenium.webdriverdriver = selenium.webdriver.Firefox()driver.get("http://www.something.com")lastElement = driver.find_elements_by_id("someId")[-1]lastElement.send_keys(Keys.NULL)

This will open a page, find the bottom-most element with the given id and the scroll that element into view. You'll have to keep querying the driver to get the last element as the page loads more, and I've found this to be pretty slow as pages get large. The time is dominated by the call to driver.find_element_* because I don't know of a way to explicitly query the last element in the page.

Through experimentation you might find there is an upper limit to the amount of elements the page loads dynamically, and it would be best if you wrote something that loaded that number and only then made a call to driver.find_element_*.


This is short & simple code which is working for me:

SCROLL_PAUSE_TIME = 20# Get scroll heightlast_height = driver.execute_script("return document.body.scrollHeight")while True:    # Scroll down to bottom    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")    # Wait to load page    time.sleep(SCROLL_PAUSE_TIME)    # Calculate new scroll height and compare with last scroll height    new_height = driver.execute_script("return document.body.scrollHeight")    if new_height == last_height:        break    last_height = new_heightposts = driver.find_elements_by_class_name("post-text")for block in posts:    print(block.text)