can we use XPath with BeautifulSoup?
Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you've parsed your document into an lxml tree, you can use the
.xpath() method to search for elements.
try: # Python 2 from urllib2 import urlopenexcept ImportError: from urllib.request import urlopenfrom lxml import etreeurl = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"response = urlopen(url)htmlparser = etree.HTMLParser()tree = etree.parse(response, htmlparser)tree.xpath(xpathselector)
There is also a dedicated
lxml.html() module with additional functionality.
Note that in the above example I passed the
response object directly to
lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the
requests library, you want to set
stream=True and pass in the
response.raw object after enabling transparent transport decompression:
import lxml.htmlimport requestsurl = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"response = requests.get(url, stream=True)response.raw.decode_content = Truetree = lxml.html.parse(response.raw)
Of possible interest to you is the CSS Selector support; the
CSSSelector class translates CSS statements into XPath expressions, making your search for
td.empformbody that much easier:
from lxml.cssselect import CSSSelectortd_empformbody = CSSSelector('td.empformbody')for elem in td_empformbody(tree): # Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'): # Do something with these table cells.
As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:
from lxml import htmlimport requestspage = requests.get('http://econpy.pythonanywhere.com/ex/001.html')tree = html.fromstring(page.content)#This will create a list of buyers:buyers = tree.xpath('//div[@title="buyer-name"]/text()')#This will create a list of pricesprices = tree.xpath('//span[@class="item-price"]/text()')print('Buyers: ', buyers)print('Prices: ', prices)
I used this as a reference.