Selenium/python: extract text from a dynamically-loading webpage after every scroll

javascript python css selenium selenium-webdriver

You can store the number of messages in a variable and use xpath and position() to get the newly added posts

dates = []messages = []num_of_posts = 1for i in range(1, ScrollNumber):    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")    time.sleep(3)    dates.extend(driver.find_elements_by_xpath('(//div[@class="message-date"])[position()>=' + str(num_of_posts) + ']'))    messages.extend(driver.find_elements_by_xpath('(//div[contains(@class, "message-body")])[position()>=' + str(num_of_posts) + ']'))    num_of_posts = len(dates)

javascript python css selenium selenium-webdriver

I had the same issue with facebook posts.For that I save the post ID (or whatever value that's unique for the post, even a Hash) in a List and then when you made the query again, you need to check if that ID is in your list or not.

Also, you can remove the DOM that is parsed, so only the new ones will exists.

javascript python css selenium selenium-webdriver

As others have said, if you can do what you need to do via hitting the API directly, thats your best bet. If you absolutely must use Selenium, see my solution below.

I do something similar to the below for my needs.

I'm leveraging :nth-child() aspect of CSS paths to individually find elements as they load.
I'm also using selenium's explicit wait functionality (via the explicit package, pip install explicit) to efficiently wait for elements to load.

The script is quit fast (no calls to sleep()), however, the webpage itself has so much junk going on in the background that it often takes a while for selenium to return control to the script.

from __future__ import print_functionfrom itertools import countimport sysimport timefrom explicit import waiter, CSSfrom selenium import webdriverfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.wait import WebDriverWait as Wait# The CSS selectors we will usePOSTS_BASE_CSS = 'ol.stream-list > li'              # All li elementsPOST_BASE_CSS = POSTS_BASE_CSS + ":nth-child({0})"  # li child element at index {0}POST_DATE_CSS = POST_BASE_CSS + ' div.message-date'     # li child element at {0} with div.message-datePOST_BODY_CSS = POST_BASE_CSS + ' div.message-body'     # li child element at {0} with div.message-dateclass Post(object):    def __init__(self, driver, post_index):        self.driver = driver        self.date_css = POST_DATE_CSS.format(post_index)        self.text_css = POST_BODY_CSS.format(post_index)    @property    def date(self):        return waiter.find_element(self.driver, self.date_css, CSS).text    @property    def text(self):        return waiter.find_element(self.driver, self.text_css, CSS).textdef get_posts(driver, url, max_screen_scrolls):    """ Post object generator """    driver.get(url)    screen_scroll_count = 0    # Wait for the initial posts to load:    waiter.find_elements(driver, POSTS_BASE_CSS, CSS)    for index in count(1):        # Evaluate if we need to scroll the screen, or exit the generator        # If there is no element at this index, it means we need to scroll the screen        if len(driver.find_elements_by_css_selector('ol.stream-list > :nth-child({0})'.format(index))) == 0:            if screen_scroll_count >= max_screen_scrolls:                # Break if we have already done the max scrolls                break            # Get count of total posts on page            post_count = len(waiter.find_elements(driver, POSTS_BASE_CSS, CSS))            # Scroll down            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")            screen_scroll_count += 1            def posts_load(driver):                """ Custom explicit wait function; waits for more posts to load in """                return len(waiter.find_elements(driver, POSTS_BASE_CSS, CSS)) > post_count            # Wait until new posts load in            Wait(driver, 20).until(posts_load)        # The list elements have sponsored ads and scripts mixed in with the posts we        # want to scrape. Check if they have a div.message-date element and continue on        # if not        includes_date_css = POST_DATE_CSS.format(index)        if len(driver.find_elements_by_css_selector(includes_date_css)) == 0:            continue        yield Post(driver, index)def main():    url = "https://stocktwits.com/symbol/USDJPY?q=%24USDjpy"    max_screen_scrolls = 4    driver = webdriver.Chrome()    try:        for post_num, post in enumerate(get_posts(driver, url, max_screen_scrolls), 1):            print("*" * 40)            print("Post #{0}".format(post_num))            print("\nDate: {0}".format(post.date))            print("Text: {0}\n".format(post.text[:34]))    finally:        driver.quit()  # Use try/finally to make sure the driver is closedif __name__ == "__main__":    main()

Full disclosure:I'm the creator of the explicit package. You could easily rewrite the above using explicit waits directly, at the expense of readability.

CodeHunter

Selenium/python: extract text from a dynamically-loading webpage after every scroll

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last