How can I get href links from HTML using Python? How can I get href links from HTML using Python? python python

How can I get href links from HTML using Python?


Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoupimport urllib2import rehtml_page = urllib2.urlopen("http://www.yourwebsite.com")soup = BeautifulSoup(html_page)for link in soup.findAll('a'):    print link.get('href')

In case you just want links starting with http://, you should use:

soup.findAll('a', attrs={'href': re.compile("^http://")})

In Python 3 with BS4 it should be:

from bs4 import BeautifulSoupimport urllib.requesthtml_page = urllib.request.urlopen("http://www.yourwebsite.com")soup = BeautifulSoup(html_page, "html.parser")for link in soup.findAll('a'):    print(link.get('href'))


You can use the HTMLParser module.

The code would probably look something like this:

from HTMLParser import HTMLParserclass MyHTMLParser(HTMLParser):    def handle_starttag(self, tag, attrs):        # Only parse the 'anchor' tag.        if tag == "a":           # Check the list of defined attributes.           for name, value in attrs:               # If href is defined, print it.               if name == "href":                   print name, "=", valueparser = MyHTMLParser()parser.feed(your_html_string)

Note: The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will automatically adapt imports when converting your sources to 3.0.


Look at using the beautiful soup html parsing library.

http://www.crummy.com/software/BeautifulSoup/

You will do something like this:

import BeautifulSoupsoup = BeautifulSoup.BeautifulSoup(html)for link in soup.findAll("a"):    print link.get("href")