Scraping hidden text of hotels reviews
If selenium
is not necessary then you can try to use requests
with Beautifulsoup
instead.
import requestsfrom bs4 import BeautifulSoupurl = 'https://www.yelp.com/biz/fairmont-san-francisco-san-francisco? sort_by=rating_desc'page = requests.get(url)soup = BeautifulSoup(page.text,'html.parser')reviews = soup.find_all('p',attrs={'lang':'en'})for review in reviews: print(review.text)
for find all reviews from all pages please try
import requestsfrom bs4 import BeautifulSoupurl = 'https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?sort_by=rating_desc'while url: page = requests.get(url) soup = BeautifulSoup(page.text,'html.parser') reviews = soup.find_all('p',attrs={'lang':'en'}) for review in reviews: print(review.text) next_page = soup.find('a',{'class':'next'}) if next_page: url = next_page['href'] else: url = None
Seems works with BeautifulSoup, well i used selenium to get the page source...see the code
from selenium import webdriverfrom bs4 import BeautifulSoupu = 'https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?sort_by=rating_desc'driver = webdriver.Chrome(executable_path = r'C:\chromedriver_win32\chromedriver.exe')#, options=options) driver.get(u)soup = BeautifulSoup(driver.page_source,'html.parser')reviews = soup.find_all('p',attrs={'lang':'en'})for review in reviews: print(review.text)
Your XPath is not finding the element. If you print the length of the list it returns zero.
Try this,
p = driver.find_elements_by_xpath("//div[@class='review-list']/ul/li//p[@lang='en']")print(len(p))for i in p: print(i.text)
You can test your XPath or CSS selector in the chrome dev tool.