Using BeautifulSoup to find a HTML tag that contains certain text

python regex beautifulsoup html-content-extraction

from BeautifulSoup import BeautifulSoupimport rehtml_text = """<h2>this is cool #12345678901</h2><h2>this is nothing</h2><h1>foo #126666678901</h1><h2>this is interesting #126666678901</h2><h2>this is blah #124445678901</h2>"""soup = BeautifulSoup(html_text)for elem in soup(text=re.compile(r' #\S{11}')):    print elem.parent

Prints:

<h2>this is cool #12345678901</h2><h2>this is interesting #126666678901</h2><h2>this is blah #124445678901</h2>

python regex beautifulsoup html-content-extraction

BeautifulSoup search operations deliver [a list of] BeautifulSoup.NavigableString objects when text= is used as a criteria as opposed to BeautifulSoup.Tag in other cases. Check the object's __dict__ to see the attributes made available to you. Of these attributes, parent is favored over previous because of changes in BS4.

from BeautifulSoup import BeautifulSoupfrom pprint import pprintimport rehtml_text = """<h2>this is cool #12345678901</h2><h2>this is nothing</h2><h2>this is interesting #126666678901</h2><h2>this is blah #124445678901</h2>"""soup = BeautifulSoup(html_text)# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.pattern = re.compile(r'cool')pprint(soup.find(text=pattern).__dict__)#>> {'next': u'\n',#>>  'nextSibling': None,#>>  'parent': <h2>this is cool #12345678901</h2>,#>>  'previous': <h2>this is cool #12345678901</h2>,#>>  'previousSibling': None}print soup.find('h2')#>> <h2>this is cool #12345678901</h2>print soup.find('h2', text=pattern)#>> this is cool #12345678901print soup.find('h2', text=pattern).parent#>> <h2>this is cool #12345678901</h2>print soup.find('h2', text=pattern) == soup.find('h2')#>> Falseprint soup.find('h2', text=pattern) == soup.find('h2').text#>> Trueprint soup.find('h2', text=pattern).parent == soup.find('h2')#>> True

python regex beautifulsoup html-content-extraction

With bs4 (Beautiful Soup 4), the OP's attempt works exactly like expected:

from bs4 import BeautifulSoupsoup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")soup('h2',text=re.compile(r' #\S{11}'))

returns [<h2> this is cool #12345678901 </h2>].

CodeHunter

Using BeautifulSoup to find a HTML tag that contains certain text

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last