How can I parse a website using Selenium and Beautifulsoup in python? [closed] How can I parse a website using Selenium and Beautifulsoup in python? [closed] selenium selenium

How can I parse a website using Selenium and Beautifulsoup in python? [closed]


Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_source attribute. You would then load the page_source into BeautifulSoup as follows:

In [8]: from bs4 import BeautifulSoupIn [9]: from selenium import webdriverIn [10]: driver = webdriver.Firefox()In [11]: driver.get('http://news.ycombinator.com')In [12]: html = driver.page_sourceIn [13]: soup = BeautifulSoup(html)In [14]: for tag in soup.find_all('title'):   ....:     print tag.text   ....:        ....:     Hacker News


As your question isn't particularly concrete, here's a simple example. To do something more useful read the BS docs. You will also find plenty of examples of selenium (and BS )usage here in SO.

from selenium import webdriverfrom bs4 import BeautifulSoupbrowser=webdriver.Firefox()browser.get('http://webpage.com')soup=BeautifulSoup(browser.page_source)#do something useful#prints all the links with corresponding textfor link in soup.find_all('a'):    print link.get('href',None),link.get_text()


Are you sure you want to use Selenium? For this reasons I used PyQt4, it's very powerful, and you can do what ever you want.

I can give you a sample code, that I just wrote, just change url and you good to go:

#! /usr/bin/env python2.7from PyQt4.QtCore import *from PyQt4.QtGui import *from PyQt4.QtWebKit import *from bs4 import BeautifulSoupimport sys, signalclass Browser(QWebView):    def __init__(self):        QWebView.__init__(self)        self.loadProgress.connect(self._progress)        self.loadFinished.connect(self._loadFinished)        self.frame = self.page().currentFrame()    def _progress(self, progress):        print str(progress) + "%"    def _loadFinished(self):        print "Load Finished"        html = unicode(self.frame.toHtml()).encode('utf-8')        soup = BeautifulSoup(html)        print soup.prettify()        self.close()if __name__ == "__main__":    app = QApplication(sys.argv)    br = Browser()    url = QUrl('http://web site that can contain javascript.com')    br.load(url)    br.show()    if signal.signal(signal.SIGINT, signal.SIG_DFL):        sys.exit(app.exec_())    app.exec_()