How can I get a Wikipedia article's text using Python 3 with Beautiful Soup?

python html web-scraping beautifulsoup wikipedia

There is a much, much more easy way to get information from wikipedia - Wikipedia API.

There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:

import wikipediaapiwiki_wiki = wikipediaapi.Wikipedia('en')page = wiki_wiki.page('Mathematics')print(page.summary)

Prints:

Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")includes the study of such topics as quantity, structure, space, andchange...(omitted intentionally)

And, in general, try to avoid screen-scraping if there's a direct API available.

python html web-scraping beautifulsoup wikipedia

select the <p> tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.

import bs4import requestsresponse = requests.get("https://en.wikipedia.org/wiki/Mathematics")if response is not None:    html = bs4.BeautifulSoup(response.text, 'html.parser')    title = html.select("#firstHeading")[0].text    paragraphs = html.select("p")    for para in paragraphs:        print (para.text)    # just grab the text up to contents as stated in question    intro = '\n'.join([ para.text for para in paragraphs[0:5]])    print (intro)

python html web-scraping beautifulsoup wikipedia

Use the library wikipedia

import wikipedia#print(wikipedia.summary("Mathematics"))#wikipedia.search("Mathematics")print(wikipedia.page("Mathematics").content)

CodeHunter

How can I get a Wikipedia article's text using Python 3 with Beautiful Soup?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last