How can I get a Wikipedia article's text using Python 3 with Beautiful Soup?
There is a much, much more easy way to get information from wikipedia - Wikipedia API.
There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:
import wikipediaapiwiki_wiki = wikipediaapi.Wikipedia('en')page = wiki_wiki.page('Mathematics')print(page.summary)
Prints:
Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")includes the study of such topics as quantity, structure, space, andchange...(omitted intentionally)
And, in general, try to avoid screen-scraping if there's a direct API available.
select the <p>
tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.
import bs4import requestsresponse = requests.get("https://en.wikipedia.org/wiki/Mathematics")if response is not None: html = bs4.BeautifulSoup(response.text, 'html.parser') title = html.select("#firstHeading")[0].text paragraphs = html.select("p") for para in paragraphs: print (para.text) # just grab the text up to contents as stated in question intro = '\n'.join([ para.text for para in paragraphs[0:5]]) print (intro)
Use the library wikipedia
import wikipedia#print(wikipedia.summary("Mathematics"))#wikipedia.search("Mathematics")print(wikipedia.page("Mathematics").content)