Web scraping - how to identify main content on a webpage

python web-scraping html-parsing webpage

There are a number of ways to do it, but, none will always work. Here are the two easiest:

if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.

python web-scraping html-parsing webpage

A while ago I wrote a simple Python script for just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content. It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags.

You'd use the script like: python webarticle2text.py <url>

python web-scraping html-parsing webpage

There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.

CodeHunter

Web scraping - how to identify main content on a webpage

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last