Python code to remove HTML tags from a string [duplicate]

Using a regex

Using a regex, you can clean everything inside <> :

import re# as per recommendation from @freylis, compile once onlyCLEANR = re.compile('<.*?>') def cleanhtml(raw_html):  cleantext = re.sub(CLEANR, '', raw_html)  return cleantext

Some HTML texts can also contain entities that are not enclosed in brackets, such as '&nsbm'. If that is the case, then you might want to write the regex as

CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

This link contains more details on this.

Using BeautifulSoup

You could also use BeautifulSoup additional package to find out all the raw text.

You will need to explicitly set a parser when calling BeautifulSoupI recommend "lxml" as mentioned in alternative answers (much more robust than the default one (html.parser) (i.e. available without additional install).

from bs4 import BeautifulSoupcleantext = BeautifulSoup(raw_html, "lxml").text

But it doesn't prevent you from using external libraries, so I recommend the first solution.

EDIT: To use lxml you need to pip install lxml.

python html xml string parsing

Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is xml.etree, which works (somewhat) similarly to the lxml example you mention:

def remove_tags(text):    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

python html xml string parsing

Note that this isn't perfect, since if you had something like, say, <a title=">"> it would break. However, it's about the closest you'd get in non-library Python without a really complex function:

import reTAG_RE = re.compile(r'<[^>]+>')def remove_tags(text):    return TAG_RE.sub('', text)

However, as lvc mentions xml.etree is available in the Python Standard Library, so you could probably just adapt it to serve like your existing lxml version:

def remove_tags(text):    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

CodeHunter

Python code to remove HTML tags from a string [duplicate]

Using a regex

Using BeautifulSoup

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last