Web scraping with Python [closed] Web scraping with Python [closed] python python

Web scraping with Python [closed]


Use urllib2 in combination with the brilliant BeautifulSoup library:

import urllib2from BeautifulSoup import BeautifulSoup# or if you're using BeautifulSoup4:# from bs4 import BeautifulSoupsoup = BeautifulSoup(urllib2.urlopen('http://example.com').read())for row in soup('table', {'class': 'spad'})[0].tbody('tr'):    tds = row('td')    print tds[0].string, tds[1].string    # will print date and sunrise


I'd really recommend Scrapy.

Quote from a deleted answer:

  • Scrapy crawling is fastest than mechanize because uses asynchronous operations (on top of Twisted).
  • Scrapy has better and fastest support for parsing (x)html on top of libxml2.
  • Scrapy is a mature framework with full unicode, handles redirections, gzipped responses, odd encodings, integrated http cache, etc.
  • Once you are into Scrapy, you can write a spider in less than 5 minutes that download images, creates thumbnails and export the extracted data directly to csv or json.


I collected together scripts from my web scraping work into this bit-bucket library.

Example script for your case:

from webscraping import download, xpathD = download.Download()html = D.get('http://example.com')for row in xpath.search(html, '//table[@class="spad"]/tbody/tr'):    cols = xpath.search(row, '/td')    print 'Sunrise: %s, Sunset: %s' % (cols[1], cols[2])

Output:

Sunrise: 08:39, Sunset: 16:08Sunrise: 08:39, Sunset: 16:09Sunrise: 08:39, Sunset: 16:10Sunrise: 08:40, Sunset: 16:10Sunrise: 08:40, Sunset: 16:11Sunrise: 08:40, Sunset: 16:12Sunrise: 08:40, Sunset: 16:13