Selenium/BeautifulSoup - Python - Loop Through Multiple Pages

python selenium selenium-webdriver web-scraping beautifulsoup

You need to parse each time you "click" on next page. So you'll want to have that included within your while loop, otherwise you're just going to continue to iterate over the the 1st page, even when it clicks to the next page, because the prod_containers object never changes.

Secondly, the way you have it, your while loop will never stop because you set pageCounter = 0, but never increment it...it will forever be < your maxPageCount.

I fixed those 2 things in the code and ran it, and it appears to have worked and parsed pages 1 through 5.

from selenium import webdriverfrom bs4 import BeautifulSoupimport redriver = webdriver.Chrome()driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')products = []hyperlinks = []reviewCounts = []starRatings = []pageCounter = 0html_soup = BeautifulSoup(driver.page_source, 'html.parser')maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1prod_containers = html_soup.find_all('li', class_ = 'products_grid')while (pageCounter < maxPageCount):    html_soup = BeautifulSoup(driver.page_source, 'html.parser')    prod_containers = html_soup.find_all('li', class_ = 'products_grid')    for product in prod_containers:        # If the product has review count, then extract:        if product.find('span', class_ = 'prod_ratingCount') is not None:            # The product name            name = product.find('div', class_ = 'prod_nameBlock')            name = re.sub(r"\s+", " ", name.text)            name = name.strip()            products.append(name)            # The product hyperlink            hyperlink = product.find('span', class_ = 'prod_ratingCount')            hyperlink = hyperlink.a            hyperlink = hyperlink.get('href')            hyperlinks.append(hyperlink)            # The product review count            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text            reviewCounts.append(reviewCount)            # The product overall star ratings            starRating = product.find('span', class_ = 'prod_ratingCount')            starRating = starRating.a            starRating = starRating.get('alt')            starRatings.append(starRating)     driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()    pageCounter +=1    print(pageCounter)

python selenium selenium-webdriver web-scraping beautifulsoup

Ok, this snippet of code will not run when run alone from a .py file, I'm guessing you were running it in iPython or a similar environment and had these vars already initialized and libraries imported.

First off, you need to include the regex package:

import re

Also, all those clear() statements are not necessary, since you initialize all those lists anyway (actually python throws an error anyway, because those lists haven't been defined yet when you call clear on them)

Also you needed to initialize counterProduct:

counterProduct = 0

and finally you have to set a value to your html_soup before referencing it in your code:

html_soup = BeautifulSoup(driver.page_source, 'html.parser')

here is the corrected code, which is working:

from selenium import webdriverfrom bs4 import BeautifulSoupimport redriver = webdriver.Chrome()driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')products = []hyperlinks = []reviewCounts = []starRatings = []pageCounter = 0html_soup = BeautifulSoup(driver.page_source, 'html.parser')maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1prod_containers = html_soup.find_all('li', class_ = 'products_grid')counterProduct = 0while (pageCounter < maxPageCount):    for product in prod_containers:        # If the product has review count, then extract:        if product.find('span', class_ = 'prod_ratingCount') is not None:            # The product name            name = product.find('div', class_ = 'prod_nameBlock')            name = re.sub(r"\s+", " ", name.text)            products.append(name)            # The product hyperlink            hyperlink = product.find('span', class_ = 'prod_ratingCount')            hyperlink = hyperlink.a            hyperlink = hyperlink.get('href')            hyperlinks.append(hyperlink)            # The product review count            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text            reviewCounts.append(reviewCount)            # The product overall star ratings            starRating = product.find('span', class_ = 'prod_ratingCount')            starRating = starRating.a            starRating = starRating.get('alt')            starRatings.append(starRating)     driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()    counterProduct +=1    print(counterProduct)

CodeHunter

Selenium/BeautifulSoup - Python - Loop Through Multiple Pages

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last