Web scraping with Python [closed]
Use urllib2 in combination with the brilliant BeautifulSoup library:
import urllib2from BeautifulSoup import BeautifulSoup# or if you're using BeautifulSoup4:# from bs4 import BeautifulSoupsoup = BeautifulSoup(urllib2.urlopen('http://example.com').read())for row in soup('table', {'class': 'spad'})[0].tbody('tr'): tds = row('td') print tds[0].string, tds[1].string # will print date and sunrise
I'd really recommend Scrapy.
Quote from a deleted answer:
- Scrapy crawling is fastest than mechanize because uses asynchronous operations (on top of Twisted).
- Scrapy has better and fastest support for parsing (x)html on top of libxml2.
- Scrapy is a mature framework with full unicode, handles redirections, gzipped responses, odd encodings, integrated http cache, etc.
- Once you are into Scrapy, you can write a spider in less than 5 minutes that download images, creates thumbnails and export the extracted data directly to csv or json.
I collected together scripts from my web scraping work into this bit-bucket library.
Example script for your case:
from webscraping import download, xpathD = download.Download()html = D.get('http://example.com')for row in xpath.search(html, '//table[@class="spad"]/tbody/tr'): cols = xpath.search(row, '/td') print 'Sunrise: %s, Sunset: %s' % (cols[1], cols[2])
Output:
Sunrise: 08:39, Sunset: 16:08Sunrise: 08:39, Sunset: 16:09Sunrise: 08:39, Sunset: 16:10Sunrise: 08:40, Sunset: 16:10Sunrise: 08:40, Sunset: 16:11Sunrise: 08:40, Sunset: 16:12Sunrise: 08:40, Sunset: 16:13