Selenium python: get all the <li> text of all the <ul> from a <div> Selenium python: get all the <li> text of all the <ul> from a <div> selenium selenium

Selenium python: get all the <li> text of all the <ul> from a <div>


Try waiting for the page to fully load before parsing it, one way is to use the time.sleep() method:

from time import sleep...for url in listURL:    driver.get(url)    sleep(5)    ...

EDIT: Using BeautifulSoup:

import requestsfrom bs4 import BeautifulSouplistURL = [    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",]list_text = []for url in listURL:    soup = BeautifulSoup(requests.get(url).content, "html.parser")    print("Link:", url)        for tag in soup.select("[id*=Lesson]:not([id*=Lessons])"):        print(tag.text)        print()        print(tag.find_next("ul").text)        print("-" * 80)    print()

Output (truncated):

Link: https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1Lesson 1man = manvrouw = womanjongen = boyik = Iben = ameen = a/anen = and--------------------------------------------------------------------------------Lesson 2meisje = girlkind = child/kidhij = heze = she (unstressed)is = isof = or--------------------------------------------------------------------------------Lesson 3appel = apple... And on

If you want the output as a list:

for url in listURL:    soup = BeautifulSoup(requests.get(url).content, "html.parser")    print("Link:", url)    print([tag.text for tag in soup.select(".mw-parser-output > ul li")])    print("-" * 80)


Your script seems to be ok, but I'd add explicit or implicit wait.Try to wait till all elements on a page are visible:

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.wait import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECoptions = webdriver.ChromeOptions()options.add_argument('headless')  # start chrome without opening windowdriver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)listURL = [    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",]list_text = []for url in listURL:    driver.get(url)    WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))    elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')    for each_ul in elem:        all_li = each_ul.find_elements_by_tag_name("li")        for li in all_li:            list_text.append(li.text)print(list_text)

Also, you can add driver.implicitly_wait(15) right after you declare driver.

Output:

['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', 'meisje = girl', 'kind = child/kid', 'hij = he', 'ze = she (unstressed)', 'is = is', 'of = or', 'appel = apple', 'melk = milk', 'drinkt = drinks (2nd and 3rd person singular)', 'drink = drink (1st person singular)', 'eet = eat(s) (singular)', 'de = the', 'sap = juice', 'water = water', 'brood = bread', 'het = it, the', 'je = you (singular informal, unstressed)', 'bent = are (2nd person singular)', 'Zijn (to be)', 'Hebben (to have)', 'Mogen (to be allowed to)', 'Willen (to want)', 'Kunnen (to be able to)', 'Zullen ("will")', 'boterham = sandwich', 'rijst = rice', 'we = we (unstressed)', 'jullie = you (plural informal)', 'eten = eat (plural)', 'drinken = drink (plural)', 'vrouwen = women', 'mannen = men', 'meisjes = girls', 'krant = newspaper', 'lezen = read (plural)', 'jongens = boys', 'menu = menu', 'dat = that', 'zijn = are (plural)', 'ze = they (unstressed)', 'heb = have (1st person singular)', 'heeft = has (3rd person singular)', 'hebt = have (2nd person singular)', 'hebben = have (plural)', 'boek = book', 'lees = read (1st person singular)', 'leest = read(s) (2nd and 3rd person singular)', 'kinderen = children', 'spreken = speak (plural)', 'spreek = speak (1st person singular)', 'spreekt = speak(s) (2nd and 3rd person singular)', 'hallo = hello', 'bedankt = thanks', 'doei = bye', 'dag = goodbye', 'tot ziens = see you later', 'hoi = hi', 'goedemorgen = good morning', 'goededag = good day', 'goedenavond = good evening', 'goedenacht = good night', 'welterusten = good night', 'ja = yes', 'dank je wel = thank you very much', 'alsjeblieft = please', 'sorry = sorry', 'het spijt me = I am sorry', 'oké = okay', 'pardon = excuse me', 'hoe gaat het = how are you', 'goed = good, fine, well', 'dank je = thank you', '(een) beetje = (a) bit of', 'Engels = English', 'Nederlands = Dutch', 'Geen: negating indefinite nouns (you can think of it as "no" things or "none of" a thing if that helps). Geen replaces the indefinite pronoun in question.', 'Niet: negating a verb, adjective or definite nouns. Niet comes at the end of a sentence or directly after the verb zijn.', 'nee = no', 'niet = not', 'geen = not']

Update:I found a more reliable way with CSS selectors. Try it please:

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.wait import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECoptions = webdriver.ChromeOptions()options.add_argument('headless')  # start chrome without opening windowdriver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)driver.implicitly_wait(15)listURL = [    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",]list_text = []for url in listURL:    driver.get(url)wait = WebDriverWait(driver, 15)wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] ")))wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.mw-parser-output>ul')))    elem = driver.find_elements_by_css_selector('.mw-parser-output>ul')    for each_ul in elem:        all_li = each_ul.find_elements_by_css_selector("li")        for li in all_li:            list_text.append(li.text)print(list_text)

Update 2After trying to understand the reason I found out that ads take the most of the time of loading. So I'm adding wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] "))) that waits till all ads are loaded.

I also changed the second wait to .mw-parser-output>ul by removing last li. I think it is not necessary. You can also try removing the second wait and see if it helps.


After

WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))

you need to add some sleep, I guess time.sleep(1) will be enough and only after that do

elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')

Your problem is caused by misunderstanding visibility_of_all_elements_located functionality.
It is not actually waiting for all the elements located by the locator you passing it to become visible, it has no idea for what amount of such elements to wait.
So once it detects at least 1 element matching your locator visible - it returns the list of detected elements and the program continues forward.
See more details about those methods here and in the official documentation.