Using BeautifulSoup to find a HTML tag that contains certain text Using BeautifulSoup to find a HTML tag that contains certain text python python

Using BeautifulSoup to find a HTML tag that contains certain text


from BeautifulSoup import BeautifulSoupimport rehtml_text = """<h2>this is cool #12345678901</h2><h2>this is nothing</h2><h1>foo #126666678901</h1><h2>this is interesting #126666678901</h2><h2>this is blah #124445678901</h2>"""soup = BeautifulSoup(html_text)for elem in soup(text=re.compile(r' #\S{11}')):    print elem.parent

Prints:

<h2>this is cool #12345678901</h2><h2>this is interesting #126666678901</h2><h2>this is blah #124445678901</h2>


BeautifulSoup search operations deliver [a list of] BeautifulSoup.NavigableString objects when text= is used as a criteria as opposed to BeautifulSoup.Tag in other cases. Check the object's __dict__ to see the attributes made available to you. Of these attributes, parent is favored over previous because of changes in BS4.

from BeautifulSoup import BeautifulSoupfrom pprint import pprintimport rehtml_text = """<h2>this is cool #12345678901</h2><h2>this is nothing</h2><h2>this is interesting #126666678901</h2><h2>this is blah #124445678901</h2>"""soup = BeautifulSoup(html_text)# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.pattern = re.compile(r'cool')pprint(soup.find(text=pattern).__dict__)#>> {'next': u'\n',#>>  'nextSibling': None,#>>  'parent': <h2>this is cool #12345678901</h2>,#>>  'previous': <h2>this is cool #12345678901</h2>,#>>  'previousSibling': None}print soup.find('h2')#>> <h2>this is cool #12345678901</h2>print soup.find('h2', text=pattern)#>> this is cool #12345678901print soup.find('h2', text=pattern).parent#>> <h2>this is cool #12345678901</h2>print soup.find('h2', text=pattern) == soup.find('h2')#>> Falseprint soup.find('h2', text=pattern) == soup.find('h2').text#>> Trueprint soup.find('h2', text=pattern).parent == soup.find('h2')#>> True


With bs4 (Beautiful Soup 4), the OP's attempt works exactly like expected:

from bs4 import BeautifulSoupsoup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")soup('h2',text=re.compile(r' #\S{11}'))

returns [<h2> this is cool #12345678901 </h2>].