can we use XPath with BeautifulSoup? can we use XPath with BeautifulSoup? python python

can we use XPath with BeautifulSoup?


Nope, BeautifulSoup, by itself, does not support XPath expressions.

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

try:    # Python 2    from urllib2 import urlopenexcept ImportError:    from urllib.request import urlopenfrom lxml import etreeurl =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"response = urlopen(url)htmlparser = etree.HTMLParser()tree = etree.parse(response, htmlparser)tree.xpath(xpathselector)

There is also a dedicated lxml.html() module with additional functionality.

Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:

import lxml.htmlimport requestsurl =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"response = requests.get(url, stream=True)response.raw.decode_content = Truetree = lxml.html.parse(response.raw)

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

from lxml.cssselect import CSSSelectortd_empformbody = CSSSelector('td.empformbody')for elem in td_empformbody(tree):    # Do something with these table cells.

Coming full circle: BeautifulSoup itself does have very complete CSS selector support:

for cell in soup.select('table#foobar td.empformbody'):    # Do something with these table cells.


I can confirm that there is no XPath support within Beautiful Soup.


As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:

from lxml import htmlimport requestspage = requests.get('http://econpy.pythonanywhere.com/ex/001.html')tree = html.fromstring(page.content)#This will create a list of buyers:buyers = tree.xpath('//div[@title="buyer-name"]/text()')#This will create a list of pricesprices = tree.xpath('//span[@class="item-price"]/text()')print('Buyers: ', buyers)print('Prices: ', prices)

I used this as a reference.