Using BeautifulSoup to find a HTML tag that contains certain text
from BeautifulSoup import BeautifulSoupimport rehtml_text = """<h2>this is cool #12345678901</h2><h2>this is nothing</h2><h1>foo #126666678901</h1><h2>this is interesting #126666678901</h2><h2>this is blah #124445678901</h2>"""soup = BeautifulSoup(html_text)for elem in soup(text=re.compile(r' #\S{11}')): print elem.parent
Prints:
<h2>this is cool #12345678901</h2><h2>this is interesting #126666678901</h2><h2>this is blah #124445678901</h2>
BeautifulSoup search operations deliver [a list of] BeautifulSoup.NavigableString
objects when text=
is used as a criteria as opposed to BeautifulSoup.Tag
in other cases. Check the object's __dict__
to see the attributes made available to you. Of these attributes, parent
is favored over previous
because of changes in BS4.
from BeautifulSoup import BeautifulSoupfrom pprint import pprintimport rehtml_text = """<h2>this is cool #12345678901</h2><h2>this is nothing</h2><h2>this is interesting #126666678901</h2><h2>this is blah #124445678901</h2>"""soup = BeautifulSoup(html_text)# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.pattern = re.compile(r'cool')pprint(soup.find(text=pattern).__dict__)#>> {'next': u'\n',#>> 'nextSibling': None,#>> 'parent': <h2>this is cool #12345678901</h2>,#>> 'previous': <h2>this is cool #12345678901</h2>,#>> 'previousSibling': None}print soup.find('h2')#>> <h2>this is cool #12345678901</h2>print soup.find('h2', text=pattern)#>> this is cool #12345678901print soup.find('h2', text=pattern).parent#>> <h2>this is cool #12345678901</h2>print soup.find('h2', text=pattern) == soup.find('h2')#>> Falseprint soup.find('h2', text=pattern) == soup.find('h2').text#>> Trueprint soup.find('h2', text=pattern).parent == soup.find('h2')#>> True
With bs4 (Beautiful Soup 4), the OP's attempt works exactly like expected:
from bs4 import BeautifulSoupsoup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")soup('h2',text=re.compile(r' #\S{11}'))
returns [<h2> this is cool #12345678901 </h2>]
.