Summarizing a Wikipedia Article Summarizing a Wikipedia Article python python

Summarizing a Wikipedia Article


Considering that your question relates more to a research activity than a programming problem, you should probably look at scientific literature. Here you will find published details of a number of algorithms that perform exactly what you want. A google search for "keyword summarization" finds the following:

Single document Summarization based on Clustering Coefficient and Transitivity Analysis

Multi-document Summarization for Query Answering E-learning System

Intelligent Email: Aiding Users with AI

If you read the above, then follow the references they contain, you will find a whole wealth of information. Certainly enough to build a functional application.


Just my two cents...

Whenever I'm browsing a new subject on Wikipedia, I typically perform a "breadth-first" search; I refuse to move on to another topic until I've scanned each and every link that the page connects to (which introduces a topic I'm not already familiar with). I read the first sentence of each paragraph, and if I see something in that article that appears to relate to the original topic, I repeat the process.

If I were to design the interface for a Wikipedia "summarizer", I would

  1. Always print the entire introductory paragraph.

  2. For the rest of the article, print any sentence that has a link in it.

    2a. Print any comma separated lists of links as a bullet pointed list.

  3. If the link to the article is "expanded", print the first paragraph for that article.

  4. If that introductory paragraph is expanded, repeat the listing of sentences with links.

This process could repeat indefinitely.

What I'm saying is that summarizing Wikipedia articles isn't the same as summarizing an article from a magazine, or a posting on a blog. The act of crawling is an important part of learning introductory concepts quickly via Wikipedia, and I feel it's for the best. Typically, the bottom half of articles is where the citation needed tags start popping up, but the first half of any given article is considered given knowledge by the community.