BeautifulSoup: extract text from anchor tag
This will help:
from bs4 import BeautifulSoupdata = '''<div class="image"> <a href="http://www.example.com/eg1">Content1<img src="http://image.example.com/img1.jpg" /></a> </div> <div class="image"> <a href="http://www.example.com/eg2">Content2<img src="http://image.example.com/img2.jpg" /> </a> </div>'''soup = BeautifulSoup(data)for div in soup.findAll('div', attrs={'class':'image'}): print(div.find('a')['href']) print(div.find('a').contents[0]) print(div.find('img')['src'])
If you are looking into Amazon products then you should be using the official API. There is at least one Python package that will ease your scraping issues and keep your activity within the terms of use.
In my case, it worked like that:
from BeautifulSoup import BeautifulSoup as bsurl="http://blabla.com"soup = bs(urllib.urlopen(url))for link in soup.findAll('a'): print link.string
Hope it helps!
I would suggest going the lxml route and using xpath.
from lxml import etree# data is the variable containing the htmldata = etree.HTML(data)anchor = data.xpath('//a[@class="title"]/text()')