Selenium/BeautifulSoup - Python - Loop Through Multiple Pages Selenium/BeautifulSoup - Python - Loop Through Multiple Pages selenium selenium

Selenium/BeautifulSoup - Python - Loop Through Multiple Pages


You need to parse each time you "click" on next page. So you'll want to have that included within your while loop, otherwise you're just going to continue to iterate over the the 1st page, even when it clicks to the next page, because the prod_containers object never changes.

Secondly, the way you have it, your while loop will never stop because you set pageCounter = 0, but never increment it...it will forever be < your maxPageCount.

I fixed those 2 things in the code and ran it, and it appears to have worked and parsed pages 1 through 5.

from selenium import webdriverfrom bs4 import BeautifulSoupimport redriver = webdriver.Chrome()driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')products = []hyperlinks = []reviewCounts = []starRatings = []pageCounter = 0html_soup = BeautifulSoup(driver.page_source, 'html.parser')maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1prod_containers = html_soup.find_all('li', class_ = 'products_grid')while (pageCounter < maxPageCount):    html_soup = BeautifulSoup(driver.page_source, 'html.parser')    prod_containers = html_soup.find_all('li', class_ = 'products_grid')    for product in prod_containers:        # If the product has review count, then extract:        if product.find('span', class_ = 'prod_ratingCount') is not None:            # The product name            name = product.find('div', class_ = 'prod_nameBlock')            name = re.sub(r"\s+", " ", name.text)            name = name.strip()            products.append(name)            # The product hyperlink            hyperlink = product.find('span', class_ = 'prod_ratingCount')            hyperlink = hyperlink.a            hyperlink = hyperlink.get('href')            hyperlinks.append(hyperlink)            # The product review count            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text            reviewCounts.append(reviewCount)            # The product overall star ratings            starRating = product.find('span', class_ = 'prod_ratingCount')            starRating = starRating.a            starRating = starRating.get('alt')            starRatings.append(starRating)     driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()    pageCounter +=1    print(pageCounter)


Ok, this snippet of code will not run when run alone from a .py file, I'm guessing you were running it in iPython or a similar environment and had these vars already initialized and libraries imported.

First off, you need to include the regex package:

import re

Also, all those clear() statements are not necessary, since you initialize all those lists anyway (actually python throws an error anyway, because those lists haven't been defined yet when you call clear on them)

Also you needed to initialize counterProduct:

counterProduct = 0

and finally you have to set a value to your html_soup before referencing it in your code:

html_soup = BeautifulSoup(driver.page_source, 'html.parser')

here is the corrected code, which is working:

from selenium import webdriverfrom bs4 import BeautifulSoupimport redriver = webdriver.Chrome()driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')products = []hyperlinks = []reviewCounts = []starRatings = []pageCounter = 0html_soup = BeautifulSoup(driver.page_source, 'html.parser')maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1prod_containers = html_soup.find_all('li', class_ = 'products_grid')counterProduct = 0while (pageCounter < maxPageCount):    for product in prod_containers:        # If the product has review count, then extract:        if product.find('span', class_ = 'prod_ratingCount') is not None:            # The product name            name = product.find('div', class_ = 'prod_nameBlock')            name = re.sub(r"\s+", " ", name.text)            products.append(name)            # The product hyperlink            hyperlink = product.find('span', class_ = 'prod_ratingCount')            hyperlink = hyperlink.a            hyperlink = hyperlink.get('href')            hyperlinks.append(hyperlink)            # The product review count            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text            reviewCounts.append(reviewCount)            # The product overall star ratings            starRating = product.find('span', class_ = 'prod_ratingCount')            starRating = starRating.a            starRating = starRating.get('alt')            starRatings.append(starRating)     driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()    counterProduct +=1    print(counterProduct)