How can I turn <br> and <p> into line breaks? How can I turn <br> and <p> into line breaks? xml xml

How can I turn <br> and <p> into line breaks?


Without some specifics, it's hard to be sure this does exactly what you want, but this should give you the idea... it assumes your b tags are wrapped inside p elements.

from BeautifulSoup import BeautifulSoupimport typesdef replace_with_newlines(element):    text = ''    for elem in element.recursiveChildGenerator():        if isinstance(elem, types.StringTypes):            text += elem.strip()        elif elem.name == 'br':            text += '\n'    return textpage = """<html><body><p>America,<br>Now is the<br>time for all good men to come to the aid<br>of their country.</p><p>pile on taxpayer debt<br></p><p>Now is the<br>time for all good men to come to the aid<br>of their country.</p></body></html>"""soup = BeautifulSoup(page)lines = soup.find("body")for line in lines.findAll('p'):    line = replace_with_newlines(line)    print line

Running this results in...

(py26_default)[mpenning@Bucksnort ~]$ python thing.pyAmerica,Now is thetime for all good men to come to the aidof their country.pile on taxpayer debtNow is thetime for all good men to come to the aidof their country.(py26_default)[mpenning@Bucksnort ~]$


get_text seems to do what you need

>>> from bs4 import BeautifulSoup>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>">>> soup = BeautifulSoup(doc)>>> soup.get_text(separator="\n")u'This is a paragraph.\nThis is another paragraph.'


This a python3 version of @Mike Pennington's Answer(it really helps),I did a litter refactor.

def replace_with_newlines(element):    text = ''    for elem in element.recursiveChildGenerator():        if isinstance(elem, str):            text += elem.strip()        elif elem.name == 'br':            text += '\n'    return textdef get_plain_text(soup):    plain_text = ''    lines = soup.find("body")    for line in lines.findAll('p'):        line = replace_with_newlines(line)        plain_text+=line    return plain_text

To use this,just pass the Beautifulsoup object to get_plain_text methond.

soup = BeautifulSoup(page)plain_text = get_plain_text(soup)