Converting html to text with Python Converting html to text with Python python python

Converting html to text with Python


soup.get_text() outputs what you want:

from bs4 import BeautifulSoupsoup = BeautifulSoup(html)print(soup.get_text())

output:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massaConsectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massaAenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massaLorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massaConsectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

To keep newlines:

print(soup.get_text('\n'))

To be identical to your example, you can replace a newline with two newlines:

soup.get_text().replace('\n','\n\n')


It's possible using python standard html.parser:

from html.parser import HTMLParserclass HTMLFilter(HTMLParser):    text = ""    def handle_data(self, data):        self.text += dataf = HTMLFilter()f.feed(data)print(f.text)


You can use a regular expression, but it's not recommended. The following code removes all the HTML tags in your data, giving you the text:

import redata = """<div class="body"><p><strong></strong></p><p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p><p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p><p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p><p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p><p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>"""data = re.sub(r'<.*?>', '', data)print(data)

Output

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massaConsectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massaAenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massaLorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massaConsectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa